sandrolabruzzo / doiBoost

4 stars 1 forks source link

what is the output format? #1

Open VladimirAlexiev opened 5 years ago

VladimirAlexiev commented 5 years ago

@sandrolabruzzo What can we use to parse the output format? It's not quite line-oriented JSON, eg (newlines added for readability):

"{'publisher': None, 
'issn': [{'type': u'print', 'value': u'0003-9942'}], 
'doi': u'10.1001/archneur.66.1.141', 'license': [], 
'published-print': u'2009-1-1', 
'title': [u'Guidelines for Letters'], 'issued': u'2009-1-1', 'abstract': [], 
'doi-url': u'http://dx.doi.org/10.1001/archneur.66.1.141', 
'instances': [{'url': u'http://jamanetwork.com/journals/jamaneurology/fullarticle/796320', 'provenance': u'CrossRef', 'access-rights': u'UNKNOWN'}], 
'authors': [], 
'collectedFrom': [u'CrossRef'], 
'accepted': None, 
'type': u'journal-article', 
'published-online': None, 
'subject': [u'Arts and Humanities (miscellaneous)', u'Clinical Neurology']}"

This looks like some Pythonic adaptation of JSON?

cc @nikolatulechki

VladimirAlexiev commented 5 years ago

List of problems:

sandrolabruzzo commented 5 years ago

You're right I'll try to fix this week the output format and try to escape characters. Once the new dump is ready in Zenodo, I'll let you know in the comment.

VladimirAlexiev commented 5 years ago

6 (perl) fixes a lot of this, but not all.

In case a string includes a single quote, u\"...\" is used as string delimiter, eg

'value': u\"http://en.wikipedia.org/wiki/Ministry_of_Health_of_the_People's_Republic_of_China\"

I'll fix this too, but of course it's better to fix it in the python so the original dataset is correct.

@nikolatulechki says this is pyspark output, hopefully you can just tell it to use a different output format (JSON).