Open gwern opened 3 years ago
Hi @gwern ,
The parse_pubmed_xml
is the only specialised parser (for anything, at all, let alone pubmed XML) in rentrez. I haven't run into these sectioned abstracts before, and because the data is only included in the Label
they will indeed be dropped.
I can look into whether the parse_pubmed_xml
can handle these better while keeping others working OK. In the meantime, I would parse the XML directly. You could get the label and the text separately, for instance.
library(XML)
parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
labels <- sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
text <- sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)
Thanks for the XML code. That works well for me, as I can combine it with the text nicely without too much formatting code:
...
library(XML)
library(tools)
parsed_XML <- entrez_fetch(db="pubmed", id=paper$id, rettype="xml", parsed=TRUE)
labels <- sapply(parsed_XML["//Abstract/AbstractText"], xmlGetAttr, "Label")
labelsFormatted <- sapply(tolower(labels),
function(s) { paste0("<strong>", toTitleCase(s), "</strong>"); })
text <- sapply(parsed_XML["//Abstract/AbstractText"], xmlValue)
combined <- paste0(paste0(labelsFormatted, rep(": ", length(labelsFormatted)), text), collapse="\n\n")
...
abstract <- { if (length(labels) > 1) { combined; } else { fulltext$abstract; } }
which gives my necessary results like "<p><strong>Background</strong>: Cannabis from hemp (Cannabis sativa and C. indica) is one of the most common illegal drugs used by drug abusers. Indian cannabis contains around 70 alkaloids, and delta-9-tetrahydrocannabinol (delta-9-THC) is the most psychoactive substance. Animal intoxications occur rarely and are mostly accidental. According to the US Animal Poison Control Center, cannabis intoxication mostly affects dogs (96%). The most common cause of such intoxication is unintentional ingestion of a cannabis product, but it may also occur after the exposure to marijuana smoke.</p> <p><strong>Case Presentation</strong>: A 6-year-old Persian cat was brought to the veterinary clinic due to strong psychomotor agitation turning into aggression. During hospitalisation for 14\160days, the cat behaved normally and had no further attacks of unwanted behaviour. It was returned to its home but shortly after it developed neurological signs again and was re-hospitalised. On presentation, the patient showed no neurological abnormalities except for symmetric mydriasis and scleral congestion. During the examination, the behaviour of the cat changed dramatically. It developed alternate states of agitation and apathy, each lasting several minutes. On interview it turned out that the cat had been exposed to marijuana smoke. Blood toxicology tests by gas chromatography tandem mass spectrometry revealed the presence of delta-9-tetrahydrocannabinol (THC) at 5.5\160ng/mL, 11-hydroxy-delta-9-THC at 1.2\160ng/mL, and 11-carboxy-delta-9-THC at 13.8\160ng/mL. The cat was given an isotonic solution of NaCl 2.5 and 2.5% glucose at a dose of 40\160mL/kg/day parenterally and was hospitalised. After complete recovery, the cat was returned to it\8217s owner and future isolation of the animal from marijuana smoke was advised.</p> <p><strong>Conclusions</strong>: This is the first case of a delta-9-tetrahydrocannabinol intoxication in a cat with both description of the clinical findings and the blood concentration of delta-9-THC and its main metabolites.</p>"
etc.
You may not have noticed the section monkey business if you don't work with abstracts much, or check them against the PMC version, but they seem to be reasonably common. (I had vaguely noticed the issue before but had put off thinking about it until a reader complained about how unreadable solid blocks of text were for some PMC links.) Watching my rebuild, there were at least 89 PMC links on gwern.net affected by the section omission, so those link annotations will be much more readable now.
When extracting abstracts from Pubmed, the section headers/labels are erased entirely and not present anywhere in the resulting objects or inlined. This makes abstracts substantially harder to read. They should be incorporated somehow (perhaps inlined as
<h3>$Label</h3>
or a separate object field which can be combined with the abstract text fields to reconstruct the original).An example of this using a modafinil paper - the PMC abstract is fully sectionized, with section labels in
<h3>
on the website, and the semantics are present in the raw XML as elements like<AbstractText Label="SETTING" NlmCategory="METHODS">
(where theLabel
is what appears as "Setting"), but the rentrez object afterparse_pubmed_xml
is merely a list of strings, with the labels stripped away. Inspecting the object, I can't find them anywhere in it, and the rentrez XML code looks like it's just dropped (only querying for\\AbstractText
or whatever):That is, the
abstract
elements 1:7 are missing their correspondingc("Study Objectives", "Design", "Setting", "Participants", "Interventions", "Measurements and Results", "Conclusions")
labels. The result then looks like:Not good.
I didn't find anything in the docs or Google about rentrez having alternative ways to parse the PMC XML I am supposed to be using here, it seems to be
parse_pubmed_xml
or bust.