ropensci-archive / pubchunks

:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles
Other
8 stars 0 forks source link

meaningful whitespace in references in some cases being stripped out #2

Closed sckott closed 4 years ago

sckott commented 6 years ago

one exmaple looks lke:

<ce:bib-reference id="BIB6">
   <ce:label>6.</ce:label>
   <sb:reference>
      <sb:contribution langtype="en">
         <sb:authors>
            <sb:author>
               <ce:given-name>F.</ce:given-name>
               <ce:surname>Morse</ce:surname>
            </sb:author>
         </sb:authors>
         <sb:title>
            <sb:maintitle>Kolebaniia i zvuk</sb:maintitle>
         </sb:title>
         <sb:translated-title>
            <sb:maintitle>Vibrations and Sound</sb:maintitle>
         </sb:translated-title>
      </sb:contribution>
      <sb:host>
         <sb:book>
            <sb:date>1949</sb:date>
            <sb:publisher>
               <sb:name>Gostekhizdat</sb:name>
               <sb:location>Moscow-Leningrad</sb:location>
            </sb:publisher>
         </sb:book>
      </sb:host>
   </sb:reference>
</ce:bib-reference>

will probably need to make a custom parser for these references - for now we just yank out all text into a single string

gwern commented 4 years ago

Does this explain why abstracts get messed up too? I was looking at some malformatted abstracts which ultimately turned out to be due to pubchunks, where all of the abstract topics get slammed into regular text due to the markup getting stripped, not even getting spaces (so they look like "BackgroundThe health benefits of regular even though the original XML is fine, <abstract><sec><title>Background</title><p>The health benefits of regular). I was hoping for an option to convert to HTML or Markdown, or even just not strip the XML, but pubchunks seems to do this automatically as soon as you try to process any fulltext object.

sckott commented 4 years ago

thanks for the comment @gwern Can you please open a new issue with an example or two of what you are talking about - and thx!