Open petermr opened 5 years ago
Determine the start of the footer and transfer it and all subsequent rows to tfoot
I have created a first pass at this. The footer
for compound
column looks like:
<column name="compound" case="insensitive" id="comp.col.comp">
<title id="comp.col.comp.tit">
<query id="comp.col.comp.tit.q">
constituent OR
compound OR
component
NOT class
</query>
</title>
<cell id="comp.col.comp.cell">
<query id="comp.col.comp.cell.q1">@CHEMICAL@</query>
<!-- <query id="comp.col.comp.cell.q2" mode="lookup">@COMPOUND_DICT@</query> -->
</cell>
<footer>
<query>total OR yield OR terpene</query>
</footer>
This split the table at the point BEFORE the first match: typical results are
AMITableTool cTree: PMC4391421
table: Table 1Chemical composition of thyme EO
column: compound => Constituents*; 64.7
column: percentage => Area % of total; 100.0
AMITableTool cTree: PMC5080681
table: Table 1Chemical composition, concentrations (%) and calculated retention indices, ofT. boveiessential oil as characterized by GC/MS analysis
column: compound => Constituents; 97.1
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT footer 27
215 [main] DEBUG org.contentmine.cproject.util.RectTabColumn - SPLIT
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT [[Trans-geraniol (Lemonol), α-citral (Trans-citral), β-citral (Cis-citral), Cis-geraniol (nerol), 3-octanol, DL-camphor, Eucalyptol(1,8) cineole, 3-octanone, Thymol, β-linalool, β-farnesene, Geranylisobutyrate, L-borneol, Isocaryophyllene, Camphene, Bergamiol, Dihydrocarveol acetate, α-cyclocitral, β-ocimene, Geranyl propionate, β-myrcene, α-terpineol, α-limonene, Nerolidol, α-terpinene, α-phellandrene, β-pinene], [Total, Yield (w/w) %, Number of constituents, Hydrocarbon monoterpenoid, Oxygenated monoterpenoid, Sesquiterpenoid hydrocarbon, Oxygenated sesquiterpenoid, Others]]
column: percentage => %; 100.0
AMITableTool cTree: PMC5132230
AMITableTool cTree: PMC5203915
table: Table 1Percentage of composition of essential oils fromRhaponticum carthamoidesroots of soil-grown plants (SGR) and hairy roots (HR).
column: compound => Constituent; 92.6
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT footer 62
271 [main] DEBUG org.contentmine.cproject.util.RectTabColumn - SPLIT
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT [[α-Pinene, Oct-1-en-3-ol, 2-Pentylfuran, α-Phellandrene, p-Cymene, β-Phellandrene, Limonene, (E)-Oct-2-enal, p-Cymenene, (E)-Non-2-enal, p-Cymen-9-ol, Thymol, Carvacrol, (E,E)-Deca-2,4-dienal, Cyprotene, 13-Norcypera-1(5),11(12)-diene, α-Longipinene, Cyperadiene, Cyclosativene, α-Copaene, α-Funebrene, Petasitene, β-Elemene, Thymol methyl ether, Cyperene, Dehydroisolongifolene, α-Cedrene, β-Caryophyllene, trans-α-Bergamotene, Sesquisabinene A, β-Helmiscapene, α-Helmiscapene, (Z)-β-Farnesene, α-Humulene, β-Santalene, Selina-3,7-diene, Rotundene, α-Acoradiene, γ-Gurjunene, Selina-4,11-diene, Dauca-4(11),8-diene, Nardosina-1(10),11-diene, β-Selinene, Pentadec-1-ene, α-Muurolene, Isorotundene, β-Bisabolene, (Z)-γ-Bisabolene, Premnaspirodiene, δ-Cadinene, Cyperene oxide, α-Calacorene, (E)-Nerolidol, β-Caryophyllene oxide, α-Corocalene, Longifolene aldehyde, 2,5,8-Trimethyl-1-naphthol, β-Himachalol, Cadalene, Aplotaxene, Cyperotundone, Palmitic acid], [Total identified, , , , , ]]
column: percentage => SGR [%]; 100.0
column: percentage => HR [%]; 92.6
AMITableTool cTree: PMC5237462
table: Table 1Major constituents of the essential oils ofM. piperita.
column: compound => Components; 82.4
300 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT footer 12
300 [main] DEBUG org.contentmine.cproject.util.RectTabColumn - SPLIT
301 [main] DEBUG org.contentmine.ami.tools.AMITableTool - SPLIT [[Thuja-2,4(10)-diene, Verbenene, β-Pinene, Mentha-2,8-diene, β-Ocimene, Linalool, Epizonarene, Epoxyocimene, Sesquiphellandrene, Cadinene, Germacrene B, null], [Monoterpene hydrocarbons, Oxygenated monoterpenes, Sesquiterpene hydrocarbons, null, ]]
column: percentage => Peak Area (%); 100.0
This works well when all non-chemical names are at the bottom. It's enough for us to extra enough "good" compounds to see if we have to increase the dictionary.
(one positive aspect is that the names are presumably contained in the mass spec lookup tables so probably "reasonably well" standardised.)
for each true composition table there is:
These may be easier to analyse in the browser. The extracted body is cyan and footer is yellow.
The original table has an implied body of 27 terpenes and and implied footer of 8 summary data starting at Total
. Check that the number of rows in each is identical and record any discrepancies:
create new columns
original
should contain:
BODY 27
FOOTER 8
and extracted
should be identical
If the extracted disagrees indicate this with an asterisk in extracted
, e.g.
BODY 26 *
if the body or footer is missing write BODY 0 and/or FOOTER 0
This is the first 10 article analysis. Added columns
PMCID | raw_table_number | raw_filename | raw_table_title | extracted_subtable_name | matches20191121 | matches20191121_notes | matches20191121_compound | matches20191121_percent | original composition | extracted compoisition | graphic_table | compound_col_name | percent_col_name | additional_percent_col_names | notes | FN | FP |
PMC4391421 | Table 1 | table_1.xml | Chemical composition of thyme EO | composition_extracted_1.html | BODY 15FOOTER 1 | BODY 12FOOTER 0 | | Constituents* | Area % of total |
PMC5080681 | Table 1 | table_1.xml | Chemical composition, concentrations (%) and calculated retention indices, of T. bovei essential oil as characterized by GC/MS analysis | composition_extracted_1.html | BODY 27FOOTER 8 | BODY 27FOOTER 8 | Constituents | % | |
PMC5132230 | Table 1 | table_1.xml | Chemical composition of the Aeollanthus suaveolens essential oil. | composition_extracted_1.html | BODY 19FOOTER 5 | Not extracted. | Compounds | Relative Percentage (%) |
PMC5203915 | Table 1 | table_1.xml | Percentage of composition of essential oils from Rhaponticum carthamoides roots of soil-grown plants (SGR) and hairy roots (HR). | composition_extracted_1.html | BODY 62FOOTER 6 | BODY 62FOOTER 6 | Constituent ; Class of compound | SGR [%] ; HR [%] | Two EO profiles. |
PMC5237462 | Table 1 | table_1.xml | Major constituents of the essential oils of M. piperita. | composition_extracted_1.html | FN | BODY 11FOOTER 4 | BODY 11FOOTER 4 | Components | Peak Area (%) |
PMC5248495 | Table 1 | table_1.xml | Chemical composition of essential oils of Ocimum basilicum var.purpureum, Ocium basilicum var. thyrsiflora, Ocimum citriodorum | composition_extracted_1.html | FN | | BODY 33FOOTER 0 | BODY 33FOOTER 0 | Chemical components | | O. basilicumvar.purpureum,%b ; O. basilicumvar.thyrsiflora,% ; O. xcitriodorum, | Three EO profile. |
PMC5282690 | TN | BODY 0FOOTER 0 | Not extracted. |
PMC5307246 | TN | BODY 0FOOTER 0 | Not extracted. |
PMC5307902 | Table 3 | table_3.xml | Percentage chemical composition of the essential oil from leaves of P. amboinicus by gas chromatography-mass spectrometry. | FN | FN | | FN | FN | BODY 19FOOTER 0 | Not extracted. | YES | FN | FN | | No EO composition is extracted. | Compounds; Area (%) | |
PMC5324201 | Table 9 | table_9.xml | Compound composition (% w/w) in the essential oil and water ... | composition_extracted_1.html | | FP - table: Proximate composition of Anethum sowa L. Root ; table: Fatty acid composition of Anethum sowa L. root extract (cold and hot extracts) by GC | | FN | BODY 24FOOTER 1 | BODY 25FOOTER 0 | | Name of Compounds | FN | | Not regular title. Multiple column headers are there. | Essential oil - Conc. (%); Water extract part - Conc. (%) | |
Test sheet with added columns original composition
and extracted compoisition
- testsheetCompositionAnalysis20191126.tsv.
Sir, please go through the updated sheet for composition extraction - compositionAnalysis20191119.tsv.
Added columns - Original composition
, Extracted composition
and error*
.
I have enhanced the software to extract more documents, so please revisit the missing documents and check whether they are now present. If so please analyse them. A typical example is PMC5132230 which now has a "composition" table.
On Wed, Nov 27, 2019 at 9:13 AM Ambarish Kumar notifications@github.com wrote:
Sir, please go through the updated sheet for composition extraction - compositionAnalysis20191119.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/compositionAnalysis20191119.tsv .
Added columns - Original composition , Extracted composition and error*.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/60?email_source=notifications&email_token=AAFTCS4EH2KSPFVOD2YHTH3QVY23DA5CNFSM4JRAN5HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFI2JZY#issuecomment-558998759, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYQKNW2644VY43FW2DQVY23DANCNFSM4JRAN5HA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
OK sir.
The composition files have the wrong names. You have composition_extracted_1.html This file does not exist. It should be composition_2.html
We cannot work with incorrect data
On Wed, Nov 27, 2019 at 10:28 AM Ambarish Kumar notifications@github.com wrote:
Updated sheet - https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/compositionAnalysis20191119.tsv
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/60?email_source=notifications&email_token=AAFTCS67J6A4ZYGNZXY6BBLQVZDWHA5CNFSM4JRAN5HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFJBOLY#issuecomment-559028015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2QGQPHQCETJ4VHH5TQVZDWHANCNFSM4JRAN5HA .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Sir, please go through these articles - PMC5590060
, PMC5603114
, PMC5933692
. compositionAnalysis20191119.tsv.
Previously composition was extracted but this time it is FN.
Also, tell should I verify compound_col_name and percent_col_name ? Is there any made changes for them (compound_col_name and percent_col_name)?
PMC5590060 | Table 1 | table_1.xml | Composition of E. foetidum essential oils. | FN | FN | BODY 34FOOTER 5 | **Not extracted**. | | | **Compounds** | **%**
PMC5603114 | Table 1 | table_1.xml | Chemical composition of resin essential oil of P. heptaphyll ... | FN | BODY 23FOOTER 0 | **Not extracted**. | **Constituents** | | **Area (%) EOPh Com. resins ; Area (%) EOPh Nat. resins** | Two EO profiles. |
PMC5933692 | Table 1 | table_1.xml | Essential oil composition of G. rosmarinifolia. Compounds be ... | | BODY 34FOOTER 1 | **Not extracted.** | | **Compound | Relative amount (%)** |
Sir, Please go through the revised composition extraction sheet - compositionAnalysis20191119.tsv
I have corrected composition file names
and tables as FPs
.
compound_col_name
and percent_col_name
are same as before.
Most of the HTML tables from EuropePMC have headers (
<thead> or <th>
) but no explicit footers. However there is usually a transition in content.Typical example:
Empirical rule: column1 contains "Total"
Task: determine empirical rules for when footer starts.