jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Proposed Parsing Updates #16

Open creisle opened 1 year ago

creisle commented 1 year ago

Since I've been going through these in such detail I've noticed a few cases where the output doesn't look like what I would expect but I want to clear them with you @jakelever before I make the appropriate changes. I've listed them in a table below

Input XML Proposed Output Current Output
incubator containing 5% CO<sub>2</sub> incubator containing 5% CO2 incubator containing 5% CO 2
10<sup>4</sup> 10^4 10 4
especially in <italic>CBL</italic>-W802* cells especially in CBL-W802* cells especially in CBL -W802* cells
influenced by the presence of allelic variants&#x2014;GSTP1 Ile<sub>105</sub>Val (rs1695) and <italic>GSTP1</italic> Ala<sub>114</sub>Val (rs1138272), with homozygote influenced by the presence of allelic variants--GSTP1 Ile105Val (rs1695) and GSTP1 Ala114Val (rs1138272), with homozygote influenced by the presence of allelic variants—GSTP1 Ile 105 Val (rs1695) and GSTP1 Ala 114 Val (rs1138272), with homozygote
breast cancer, clear cell renal carcinoma, and colon cancer<xref ref-type="bibr" rid="b6">6</xref><xref ref-type="bibr" rid="b7">7</xref> <xref ref-type="bibr" rid="b8">8</xref> <xref ref-type="bibr" rid="b9">9</xref> <xref ref-type="bibr" rid="b10">10</xref> have successfully identified breast cancer, clear cell renal carcinoma, and colon cancer have successfully identified breast cancer, clear cell renal carcinoma, and colon cancerhave successfully identified
, and in the transgenic\nGATA-1,\n<sup>low</sup> mouse , and in the transgenic GATA-1, low mouse , and in the transgenicGATA-1, low mouse
we selected an allele (designated <italic>cic</italic><sup><italic>4</italic></sup>) that removes we selected an allele (designated cic^4) that removes we selected an allele (designated cic 4) that removes
regulation of the Wnt-&#x3B2;-catenin pathway regulation of the Wnt-beta-catenin pathway regulation of the Wnt-β-catenin pathway
the specific HPV<sup>+</sup> gene expression the specific HPV+ gene expression the specific HPV + gene expression
known to be resistant to 1<sup>st</sup> and 2<sup>nd</sup> generation EGFR-TKIS, osimertinib known to be resistant to 1st and 2nd generation EGFR-TKIS, osimertinib known to be resistant to 1 st and 2 nd generation EGFR-TKIS, osimertinib
at 37&#xB0;C in a humidified 5% CO<sub>2</sub> incubator at 37 deg C in a humidified 5% CO2 incubator at 37°C in a humidified 5% CO 2 incubator
seeded at concentrations below 1 &#xD7; 10<sup>6</sup>/ml, selected seeded at concentrations below 1 x 10^6/ml, selected seeded at concentrations below 1 × 10 6 /ml, selected
9 patients with a <italic>BRAF</italic>-mutant tumour 9 patients with a BRAF-mutant tumour 9 patients with a BRAF -mutant tumour
patients with <italic>BRAF</italic><sup>WT</sup> tumours patients with BRAF-WT tumours patients with BRAF WT tumours
MSI<sup>hi</sup> tumours MSI-hi tumours MSI hi tumours
upper limit of normal, creatinine clearance &#x2A7E;30&#x2009;ml&#x2009;min<sup>&#x2212;1</sup>, upper limit of normal, creatinine clearance ⩾30 ml min^-1, upper limit of normal, creatinine clearance ⩾30 ml min −1,
the oncometabolite R(&#x2013;)-2-hydroxyglutarate at the the oncometabolite R(-)-2-hydroxyglutarate at the the oncometabolite R-2-hydroxyglutarate at the
[<sup>3</sup>H]-Thymidine [3H]-Thymidine [ 3 H]-Thymidine
jakelever commented 1 year ago

These all look good to me

creisle commented 1 year ago

Another weird case I am not sure what to do with

<sec><title>Title of a thing</title><p>paragraph content</p></sec>

becomes

<passage>Title of a thing</passage><passage>paragraph content</passage>

which makes less sense for bioc, but when it gets concatenated together tho as

Title of a thingparagraph content

We need whitespace between the two. Should we be adding a trailing single space or new line to the first passage when we parse the XML?

creisle commented 1 year ago

Another weird special case on the superscripts to add to the tests

Compared with <italic>KRAS</italic> wild type and empty vector controls, <italic>KRAS</italic> <sup>10</sup>G<sup>11</sup> and <sup>11</sup>GA<sup>12</sup> significantly enhanced in vivo tumor growth

should be

Compared with KRAS wild type and empty vector controls, KRAS 10G11 and 11GA12 significantly enhanced in vivo tumor growth

creisle commented 1 year ago
Input XML Proposed Output Current Output
The 2-year invasive disease-free survival rate was 93·9% The 2-year invasive disease-free survival rate was 93.9% The 2-year invasive disease-free survival rate was 93*9%