`parse_pubmed_xml` erases sections/header labels in PMC abstracts

When extracting abstracts from Pubmed, the section headers/labels are erased entirely and not present anywhere in the resulting objects or inlined. This makes abstracts substantially harder to read. They should be incorporated somehow (perhaps inlined as <h3>$Label</h3> or a separate object field which can be combined with the abstract text fields to reconstruct the original).

An example of this using a modafinil paper - the PMC abstract is fully sectionized, with section labels in <h3> on the website, and the semantics are present in the raw XML as elements like <AbstractText Label="SETTING" NlmCategory="METHODS"> (where the Label is what appears as "Setting"), but the rentrez object after parse_pubmed_xml is merely a list of strings, with the labels stripped away. Inspecting the object, I can't find them anywhere in it, and the rentrez XML code looks like it's just dropped (only querying for \\AbstractText or whatever):

library(fulltext)                                                                                                                                       
library(rentrez)                                                                                                                                        
library(pubchunks)

pmcidSearch = "PMC2910532"
paper     <- entrez_search(db="pubmed", term=pmcidSearch)                                                                                                   
rawXML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml")                                                                                         
fulltext   <- parse_pubmed_xml(rawXML)                                                                                                                    
abstract <- fulltext$abstract

abstract
# [1] "Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis."
# [2] "Placebo-controlled, double-blind, randomized crossover study."
# [3] "Sleep laboratory in temporal isolation unit."
# [4] "22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes)."
# [5] "2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness."
# [6] "Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and > 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT."
# [7] "The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis."
str(abstract)
# chr [1:7] "Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depe"| __truncated__ ...
rawXML
# [1] "<?xml version=\"1.0\" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC \"-//NLM//DTD PubMedArticle, 1st January 2019//EN\" \"https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd\">\n<PubmedArticleSet>\n<PubmedArticle>\n    <MedlineCitation Status=\"MEDLINE\" Owner=\"NLM\">\n        <PMID Version=\"1\">20815183</PMID>\n        <DateCompleted>\n            <Year>2010</Year>\n            <Month>09</Month>\n            <Day>23</Day>\n        </DateCompleted>\n        <DateRevised>\n            <Year>2019</Year>\n            <Month>05</Month>\n            <Day>13</Day>\n        </DateRevised>\n        <Article PubModel=\"Print\">\n            <Journal>\n                <ISSN IssnType=\"Print\">0161-8105</ISSN>\n                <JournalIssue CitedMedium=\"Print\">\n                    <Volume>33</Volume>\n                    <Issue>8</Issue>\n                    <PubDate>\n                        <Year>2010</Year>\n                        <Month>Aug</Month>\n                    </PubDate>\n                </JournalIssue>\n                <Title>Sleep</Title>\n                <ISOAbbreviation>Sleep</ISOAbbreviation>\n            </Journal>\n            <ArticleTitle>Effects of modafinil on the sleep EEG depend on Val158Met genotype of COMT.</ArticleTitle>\n            <Pagination>\n                <MedlinePgn>1027-35</MedlinePgn>\n            </Pagination>\n            
# <Abstract>\n                
# <AbstractText Label=\"STUDY OBJECTIVES\" NlmCategory=\"OBJECTIVE\">Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis.</AbstractText>\n                
# <AbstractText Label=\"DESIGN\" NlmCategory=\"METHODS\">Placebo-controlled, double-blind, randomized crossover study.</AbstractText>\n                <AbstractText Label=\"SETTING\" NlmCategory=\"METHODS\">Sleep laboratory in temporal isolation unit.</AbstractText>\n                
# <AbstractText Label=\"PARTICIPANTS\" NlmCategory=\"METHODS\">22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes).</AbstractText>\n                
# <AbstractText Label=\"INTERVENTIONS\" NlmCategory=\"METHODS\">2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness.</AbstractText>\n                <AbstractText Label=\"MEASUREMENTS AND RESULTS\" NlmCategory=\"RESULTS\">Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and &gt; 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT.</AbstractText>\n                <AbstractText Label=\"CONCLUSIONS\" NlmCategory=\"CONCLUSIONS\">The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis.</AbstractText>\n            </Abstract>\n ...

That is, the abstract elements 1:7 are missing their corresponding c("Study Objectives", "Design", "Setting", "Participants", "Interventions", "Measurements and Results", "Conclusions") labels. The result then looks like:

Modafinil may promote wakefulness by increasing cerebral dopaminergic neurotransmission, which importantly depends on activity of catechol-O-methyltransferase (COMT) in prefrontal cortex. The effects of modafinil on sleep homeostasis in humans are unknown. Employing a novel sleep-pharmacogenetic approach, we investigated the interaction of modafinil with sleep deprivation to study dopaminergic mechanisms of sleep homeostasis. Placebo-controlled, double-blind, randomized crossover study. Sleep laboratory in temporal isolation unit. 22 healthy young men (23.4 +/- 0.5 years) prospectively enrolled based on genotype of the functional Val158Met polymorphism of COMT(10 Val/Val and 12 Met/Met homozygotes). 2 x 100 mg modafinil and placebo administered at 11 and 23 hours during 40 hours prolonged wakefulness. Subjective sleepiness and EEG markers of sleep homeostasis in wakefulness and sleep were equally affected by sleep deprivation in Val/Val and Met/Met allele carriers (placebo condition). Modafinil attenuated the evolution of sleepiness and EEG 5-8 Hz activity during sleep deprivation in both genotypes. In contrast to caffeine, modafinil did not reduce EEG slow wave activity (0.75-4.5 Hz) in recovery sleep, yet specifically increased 3.0-6.75 Hz and > 16.75 Hz activity in NREM sleep in the Val/Val genotype of COMT. The Val158Met polymorphism of COMT modulates the effects of modafinil on the NREM sleep EEG in recovery sleep after prolonged wakefulness. The sleep EEG changes induced by modafinil markedly differ from those of caffeine, showing that pharmacological interference with dopaminergic and adenosinergic neurotransmission during sleep deprivation differently affects sleep homeostasis.

Not good.

I didn't find anything in the docs or Google about rentrez having alternative ways to parse the PMC XML I am supposed to be using here, it seems to be parse_pubmed_xml or bust.

Thanks for the XML code. That works well for me, as I can combine it with the text nicely without too much formatting code:

    ...
    library(XML)
    library(tools)
    parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
    labels <-  sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
    labelsFormatted <- sapply(tolower(labels), 
      function(s) { paste0("<strong>", toTitleCase(s), "</strong>"); })
    text <-    sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)
    combined <- paste0(paste0(labelsFormatted, rep(": ", length(labelsFormatted)), text), collapse="\n\n")
    ...
    abstract <- { if (length(labels) > 1) { combined; } else { fulltext$abstract; } }

which gives my necessary results like "<p><strong>Background</strong>: Cannabis from hemp (Cannabis sativa and C. indica) is one of the most common illegal drugs used by drug abusers. Indian cannabis contains around 70 alkaloids, and delta-9-tetrahydrocannabinol (delta-9-THC) is the most psychoactive substance. Animal intoxications occur rarely and are mostly accidental. According to the US Animal Poison Control Center, cannabis intoxication mostly affects dogs (96%). The most common cause of such intoxication is unintentional ingestion of a cannabis product, but it may also occur after the exposure to marijuana smoke.</p> <p><strong>Case Presentation</strong>: A 6-year-old Persian cat was brought to the veterinary clinic due to strong psychomotor agitation turning into aggression. During hospitalisation for 14\160days, the cat behaved normally and had no further attacks of unwanted behaviour. It was returned to its home but shortly after it developed neurological signs again and was re-hospitalised. On presentation, the patient showed no neurological abnormalities except for symmetric mydriasis and scleral congestion. During the examination, the behaviour of the cat changed dramatically. It developed alternate states of agitation and apathy, each lasting several minutes. On interview it turned out that the cat had been exposed to marijuana smoke. Blood toxicology tests by gas chromatography tandem mass spectrometry revealed the presence of delta-9-tetrahydrocannabinol (THC) at 5.5\160ng/mL, 11-hydroxy-delta-9-THC at 1.2\160ng/mL, and 11-carboxy-delta-9-THC at 13.8\160ng/mL. The cat was given an isotonic solution of NaCl 2.5 and 2.5% glucose at a dose of 40\160mL/kg/day parenterally and was hospitalised. After complete recovery, the cat was returned to it\8217s owner and future isolation of the animal from marijuana smoke was advised.</p> <p><strong>Conclusions</strong>: This is the first case of a delta-9-tetrahydrocannabinol intoxication in a cat with both description of the clinical findings and the blood concentration of delta-9-THC and its main metabolites.</p>" etc.

You may not have noticed the section monkey business if you don't work with abstracts much, or check them against the PMC version, but they seem to be reasonably common. (I had vaguely noticed the issue before but had put off thinking about it until a reader complained about how unreadable solid blocks of text were for some PMC links.) Watching my rebuild, there were at least 89 PMC links on gwern.net affected by the section omission, so those link annotations will be much more readable now.

ropensci / rentrez

`parse_pubmed_xml` erases sections/header labels in PMC abstracts #170