Open VladimirAlexiev opened 5 years ago
List of problems:
"
(not conforming to NDJSON)\"delay-in-days
starts with a quote\xXX
but JSON supports only \uXXXX
\\u2018juridique\\u2019
), should be singleu'string'
we should have "string"
<>
, eg http://dx.doi.org/10.1002/1097-010x(20010201)289:2%3c130::aid-jez6>3.0.co;2-#
You're right I'll try to fix this week the output format and try to escape characters. Once the new dump is ready in Zenodo, I'll let you know in the comment.
In case a string includes a single quote, u\"...\"
is used as string delimiter, eg
'value': u\"http://en.wikipedia.org/wiki/Ministry_of_Health_of_the_People's_Republic_of_China\"
I'll fix this too, but of course it's better to fix it in the python so the original dataset is correct.
@nikolatulechki says this is pyspark output, hopefully you can just tell it to use a different output format (JSON).
@sandrolabruzzo What can we use to parse the output format? It's not quite line-oriented JSON, eg (newlines added for readability):
u'STRING'
notation is JSONnull
notNone
?This looks like some Pythonic adaptation of JSON?
cc @nikolatulechki