petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
27 stars 19 forks source link

create footers for tables #60

Open petermr opened 5 years ago

petermr commented 5 years ago

Most of the HTML tables from EuropePMC have headers (<thead> or <th>) but no explicit footers. However there is usually a transition in content.

Typical example:

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://www.w3.org/1999/xhtml">
 <caption class="caption">
  <label class="label">Table 1</label>
  <p class="p" xmlns="">Chemical composition of thyme EO</p>
 </caption>
 <tbody class="tbody">
  <tr class="tr" xmlns="">
   <th align="center" rowspan="1" colspan="1" class="th">No.</th>
   <th align="center" rowspan="1" colspan="1" class="th">RT (min)</th>
   <th align="center" rowspan="1" colspan="1" class="th">Area % of total</th>
   <th align="center" rowspan="1" colspan="1" class="th">Constituents*</th>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">1</td>
   <td align="center" rowspan="1" colspan="1" class="td">5.39</td>
   <td align="center" rowspan="1" colspan="1" class="td">1.06</td>
   <td align="center" rowspan="1" colspan="1" class="td">alpha-Thujene</td>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">2</td>
   <td align="center" rowspan="1" colspan="1" class="td">5.63</td>
   <td align="center" rowspan="1" colspan="1" class="td">1.07</td>
   <td align="center" rowspan="1" colspan="1" class="td">alpha-Pinene</td>
  </tr>
...
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">15</td>
   <td align="center" rowspan="1" colspan="1" class="td">19.03</td>
   <td align="center" rowspan="1" colspan="1" class="td">0.78</td>
   <td align="center" rowspan="1" colspan="1" class="td">Cyclohexene, 1-methyl-4-(5-methyl-1-methylene-4-hexenyl)</td>
  </tr>
  <tr class="tr" xmlns="">
   <th align="center" rowspan="1" colspan="1" class="th">Total</th>
   <td align="center" rowspan="1" colspan="1" class="td"/>
   <th align="center" rowspan="1" colspan="1" class="th">99.91%</th>
   <td align="center" rowspan="1" colspan="1" class="td"/>
  </tr>
  <tr class="tr" xmlns="">
   <td align="center" rowspan="1" colspan="1" class="td">*Constituents presented in the order of elution from the VF 35 MS column.</td>
  </tr>
 </tbody>
</table>

Empirical rule: column1 contains "Total"

Task: determine empirical rules for when footer starts.

petermr commented 5 years ago

split tables into body and footer

Determine the start of the footer and transfer it and all subsequent rows to tfoot

petermr commented 5 years ago

I have created a first pass at this. The footer for compound column looks like:

        <column name="compound" case="insensitive" id="comp.col.comp">
            <title id="comp.col.comp.tit">
                <query id="comp.col.comp.tit.q">
                    constituent OR
                    compound OR
                    component
                    NOT class
                </query>
            </title>
            <cell id="comp.col.comp.cell">
              <query id="comp.col.comp.cell.q1">@CHEMICAL@</query>
<!--              <query id="comp.col.comp.cell.q2" mode="lookup">@COMPOUND_DICT@</query> -->
            </cell>
            <footer>
                <query>total OR yield OR terpene</query>
            </footer>

This split the table at the point BEFORE the first match: typical results are

AMITableTool cTree: PMC4391421
  table: Table 1Chemical composition of thyme EO
      column: compound => Constituents*; 64.7
      column: percentage => Area % of total; 100.0
AMITableTool cTree: PMC5080681
  table: Table 1Chemical composition, concentrations (%) and calculated retention indices, ofT. boveiessential oil as characterized by GC/MS analysis
      column: compound => Constituents; 97.1
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 27
215 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
215 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[Trans-geraniol (Lemonol), α-citral (Trans-citral), β-citral (Cis-citral), Cis-geraniol (nerol), 3-octanol, DL-camphor, Eucalyptol(1,8) cineole, 3-octanone, Thymol, β-linalool, β-farnesene, Geranylisobutyrate, L-borneol, Isocaryophyllene, Camphene, Bergamiol, Dihydrocarveol acetate, α-cyclocitral, β-ocimene, Geranyl propionate, β-myrcene, α-terpineol, α-limonene, Nerolidol, α-terpinene, α-phellandrene, β-pinene], [Total, Yield (w/w) %, Number of constituents, Hydrocarbon monoterpenoid, Oxygenated monoterpenoid, Sesquiterpenoid hydrocarbon, Oxygenated sesquiterpenoid, Others]]
      column: percentage => %; 100.0
AMITableTool cTree: PMC5132230
AMITableTool cTree: PMC5203915
  table: Table 1Percentage of composition of essential oils fromRhaponticum carthamoidesroots of soil-grown plants (SGR) and hairy roots (HR).
      column: compound => Constituent; 92.6
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 62
271 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
271 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[α-Pinene, Oct-1-en-3-ol, 2-Pentylfuran, α-Phellandrene, p-Cymene, β-Phellandrene, Limonene, (E)-Oct-2-enal, p-Cymenene, (E)-Non-2-enal, p-Cymen-9-ol, Thymol, Carvacrol, (E,E)-Deca-2,4-dienal, Cyprotene, 13-Norcypera-1(5),11(12)-diene, α-Longipinene, Cyperadiene, Cyclosativene, α-Copaene, α-Funebrene, Petasitene, β-Elemene, Thymol methyl ether, Cyperene, Dehydroisolongifolene, α-Cedrene, β-Caryophyllene, trans-α-Bergamotene, Sesquisabinene A, β-Helmiscapene, α-Helmiscapene, (Z)-β-Farnesene, α-Humulene, β-Santalene, Selina-3,7-diene, Rotundene, α-Acoradiene, γ-Gurjunene, Selina-4,11-diene, Dauca-4(11),8-diene, Nardosina-1(10),11-diene, β-Selinene, Pentadec-1-ene, α-Muurolene, Isorotundene, β-Bisabolene, (Z)-γ-Bisabolene, Premnaspirodiene, δ-Cadinene, Cyperene oxide, α-Calacorene, (E)-Nerolidol, β-Caryophyllene oxide, α-Corocalene, Longifolene aldehyde, 2,5,8-Trimethyl-1-naphthol, β-Himachalol, Cadalene, Aplotaxene, Cyperotundone, Palmitic acid], [Total identified,  ,  ,  ,  ,  ]]
      column: percentage => SGR [%]; 100.0
      column: percentage => HR [%]; 92.6
AMITableTool cTree: PMC5237462
  table: Table 1Major constituents of the essential oils ofM. piperita.
      column: compound => Components; 82.4
300 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT footer 12
300 [main] DEBUG org.contentmine.cproject.util.RectTabColumn  - SPLIT
301 [main] DEBUG org.contentmine.ami.tools.AMITableTool  - SPLIT [[Thuja-2,4(10)-diene, Verbenene, β-Pinene, Mentha-2,8-diene, β-Ocimene, Linalool, Epizonarene, Epoxyocimene, Sesquiphellandrene, Cadinene, Germacrene B, null], [Monoterpene hydrocarbons, Oxygenated monoterpenes, Sesquiterpene hydrocarbons, null,  ]]
      column: percentage => Peak Area (%); 100.0

This works well when all non-chemical names are at the bottom. It's enough for us to extra enough "good" compounds to see if we have to increase the dictionary.

(one positive aspect is that the names are presumably contained in the mass spec lookup tables so probably "reasonably well" standardised.)

petermr commented 5 years ago

check extracted body and footer

for each true composition table there is:

These may be easier to analyse in the browser. The extracted body is cyan and footer is yellow.

The original table has an implied body of 27 terpenes and and implied footer of 8 summary data starting at Total. Check that the number of rows in each is identical and record any discrepancies:

create new columns

original should contain:

BODY 27
FOOTER 8

and extracted should be identical

If the extracted disagrees indicate this with an asterisk in extracted , e.g. BODY 26 *

if the body or footer is missing write BODY 0 and/or FOOTER 0

ambarishK commented 5 years ago

This is the first 10 article analysis. Added columns


PMCID | raw_table_number | raw_filename | raw_table_title | extracted_subtable_name | matches20191121 | matches20191121_notes | matches20191121_compound | matches20191121_percent | original composition | extracted compoisition | graphic_table | compound_col_name | percent_col_name | additional_percent_col_names | notes | FN | FP |  

PMC4391421 | Table 1 | table_1.xml | Chemical composition of thyme EO | composition_extracted_1.html |  BODY 15FOOTER 1 | BODY 12FOOTER 0 |   | Constituents* | Area % of total |  

PMC5080681 | Table 1 | table_1.xml | Chemical composition, concentrations (%) and calculated retention indices, of T. bovei essential oil as characterized by GC/MS analysis | composition_extracted_1.html  | BODY 27FOOTER 8 | BODY 27FOOTER 8 |   Constituents | % |     | 
PMC5132230 | Table 1 | table_1.xml | Chemical composition of the Aeollanthus suaveolens essential oil. | composition_extracted_1.html  | BODY 19FOOTER 5 | Not extracted. |  Compounds | Relative Percentage (%) |  

PMC5203915 | Table 1 | table_1.xml | Percentage of composition of essential oils from Rhaponticum carthamoides roots of soil-grown plants (SGR) and hairy roots (HR). | composition_extracted_1.html |  BODY 62FOOTER 6 | BODY 62FOOTER 6 |   Constituent ; Class of compound |   SGR [%] ; HR [%] | Two EO profiles. |    

PMC5237462 | Table 1 | table_1.xml | Major constituents of the essential oils of M. piperita. | composition_extracted_1.html |   FN |  BODY 11FOOTER 4 | BODY 11FOOTER 4 |  Components | Peak Area (%) |   

PMC5248495 | Table 1 | table_1.xml | Chemical composition of essential oils of Ocimum basilicum var.purpureum, Ocium basilicum var. thyrsiflora, Ocimum citriodorum | composition_extracted_1.html | FN |   | BODY 33FOOTER 0 | BODY 33FOOTER 0 |  Chemical components |   | O. basilicumvar.purpureum,%b ; O. basilicumvar.thyrsiflora,% ; O. xcitriodorum, | Three EO profile. |    

PMC5282690 | TN |  BODY 0FOOTER 0 | Not extracted. |     

PMC5307246 | TN | BODY 0FOOTER 0 | Not extracted. |  

PMC5307902 | Table 3 | table_3.xml | Percentage chemical composition of the essential oil from leaves of P. amboinicus by gas chromatography-mass spectrometry. | FN | FN |   | FN | FN | BODY 19FOOTER 0 | Not extracted. | YES | FN | FN |   | No EO composition is extracted. | Compounds; Area (%) |   |  

PMC5324201 | Table 9 | table_9.xml | Compound composition (% w/w) in the essential oil and water ... | composition_extracted_1.html |   | FP - table:  Proximate composition of Anethum sowa L. Root ; table:  Fatty acid composition of Anethum sowa L. root extract (cold and hot extracts) by GC |   | FN | BODY 24FOOTER 1 | BODY 25FOOTER 0 |   | Name of Compounds | FN |   | Not regular title. Multiple column headers are there. | Essential oil - Conc. (%); Water extract part - Conc. (%) |   |  

Test sheet with added columns original composition and extracted compoisition - testsheetCompositionAnalysis20191126.tsv.

ambarishK commented 5 years ago

Sir, please go through the updated sheet for composition extraction - compositionAnalysis20191119.tsv.

Added columns - Original composition , Extracted composition and error*.

petermr commented 5 years ago

I have enhanced the software to extract more documents, so please revisit the missing documents and check whether they are now present. If so please analyse them. A typical example is PMC5132230 which now has a "composition" table.

On Wed, Nov 27, 2019 at 9:13 AM Ambarish Kumar notifications@github.com wrote:

Sir, please go through the updated sheet for composition extraction - compositionAnalysis20191119.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/compositionAnalysis20191119.tsv .

Added columns - Original composition , Extracted composition and error*.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/60?email_source=notifications&email_token=AAFTCS4EH2KSPFVOD2YHTH3QVY23DA5CNFSM4JRAN5HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFI2JZY#issuecomment-558998759, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYQKNW2644VY43FW2DQVY23DANCNFSM4JRAN5HA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

OK sir.

ambarishK commented 5 years ago

Updated sheet - https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/compositionAnalysis20191119.tsv

petermr commented 5 years ago

The composition files have the wrong names. You have composition_extracted_1.html This file does not exist. It should be composition_2.html

We cannot work with incorrect data

On Wed, Nov 27, 2019 at 10:28 AM Ambarish Kumar notifications@github.com wrote:

Updated sheet - https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/compositionAnalysis20191119.tsv

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/60?email_source=notifications&email_token=AAFTCS67J6A4ZYGNZXY6BBLQVZDWHA5CNFSM4JRAN5HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFJBOLY#issuecomment-559028015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2QGQPHQCETJ4VHH5TQVZDWHANCNFSM4JRAN5HA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 5 years ago

Sir, please go through these articles - PMC5590060, PMC5603114, PMC5933692. compositionAnalysis20191119.tsv.

Previously composition was extracted but this time it is FN.

Also, tell should I verify compound_col_name and percent_col_name ? Is there any made changes for them (compound_col_name and percent_col_name)?


PMC5590060 | Table 1 | table_1.xml | Composition of E. foetidum essential oils.  | FN | FN | BODY 34FOOTER 5 | **Not extracted**. |   |   | **Compounds** | **%**

PMC5603114 | Table 1 | table_1.xml | Chemical composition of resin essential oil of P. heptaphyll ... | FN |  BODY 23FOOTER 0 | **Not extracted**. |  **Constituents** |   | **Area (%) EOPh  Com. resins ; Area (%) EOPh  Nat. resins** | Two EO profiles. | 

PMC5933692 | Table 1 | table_1.xml | Essential oil composition of G. rosmarinifolia. Compounds be ... |   | BODY 34FOOTER 1 | **Not extracted.** |   | **Compound | Relative amount (%)** |  

 

ambarishK commented 5 years ago

Sir, Please go through the revised composition extraction sheet - compositionAnalysis20191119.tsv

I have corrected composition file names and tables as FPs.

compound_col_name and percent_col_name are same as before.