Closed sckott closed 4 years ago
Does this explain why abstracts get messed up too? I was looking at some malformatted abstracts which ultimately turned out to be due to pubchunks, where all of the abstract topics get slammed into regular text due to the markup getting stripped, not even getting spaces (so they look like "BackgroundThe health benefits of regular
even though the original XML is fine, <abstract><sec><title>Background</title><p>The health benefits of regular
). I was hoping for an option to convert to HTML or Markdown, or even just not strip the XML, but pubchunks seems to do this automatically as soon as you try to process any fulltext
object.
thanks for the comment @gwern Can you please open a new issue with an example or two of what you are talking about - and thx!
one exmaple looks lke:
will probably need to make a custom parser for these references - for now we just yank out all text into a single string