VladimirAlexiev commented 5 years ago

@sandrolabruzzo What can we use to parse the output format? It's not quite line-oriented JSON, eg (newlines added for readability):

"{'publisher': None, 
'issn': [{'type': u'print', 'value': u'0003-9942'}], 
'doi': u'10.1001/archneur.66.1.141', 'license': [], 
'published-print': u'2009-1-1', 
'title': [u'Guidelines for Letters'], 'issued': u'2009-1-1', 'abstract': [], 
'doi-url': u'http://dx.doi.org/10.1001/archneur.66.1.141', 
'instances': [{'url': u'http://jamanetwork.com/journals/jamaneurology/fullarticle/796320', 'provenance': u'CrossRef', 'access-rights': u'UNKNOWN'}], 
'authors': [], 
'collectedFrom': [u'CrossRef'], 
'accepted': None, 
'type': u'journal-article', 
'published-online': None, 
'subject': [u'Arts and Humanities (miscellaneous)', u'Clinical Neurology']}"

the whole line is surrounded in quotes
I'm not sure the u'STRING' notation is JSON
I think JSON uses null not None?

This looks like some Pythonic adaptation of JSON?

cc @nikolatulechki

VladimirAlexiev commented 5 years ago

List of problems:

lines are surrounded by " (not conforming to NDJSON)
the field \"delay-in-days starts with a quote
some dates have 1-digit month and day number (not conforming to XSD)
some chars are escaped as \xXX but JSON supports only \uXXXX
Unicode escapes use quadruple backslashes (sometimes double backslashes eg \\u2018juridique\\u2019), should be single
JSON strings/keys should use double quotes not single quotes
JSON strings don't use a "u" prefix. So instead of u'string' we should have "string"
apostrophes in JSON don't need (nor admit) backslash escapes
sometimes 'UnpayWall' appears without prefix u'..'
some URLs include forbidden chars <>, eg http://dx.doi.org/10.1002/1097-010x(20010201)289:2%3c130::aid-jez6>3.0.co;2-#

sandrolabruzzo commented 5 years ago

You're right I'll try to fix this week the output format and try to escape characters. Once the new dump is ready in Zenodo, I'll let you know in the comment.

VladimirAlexiev commented 5 years ago

6 (perl) fixes a lot of this, but not all.

In case a string includes a single quote, u\"...\" is used as string delimiter, eg

'value': u\"http://en.wikipedia.org/wiki/Ministry_of_Health_of_the_People's_Republic_of_China\"

I'll fix this too, but of course it's better to fix it in the python so the original dataset is correct.

@nikolatulechki says this is pyspark output, hopefully you can just tell it to use a different output format (JSON).

sandrolabruzzo / doiBoost

what is the output format? #1

6 (perl) fixes a lot of this, but not all.