jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Special Case: Table with multiple separated header lines #18

Open creisle opened 1 year ago

creisle commented 1 year ago

Examples from here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3704624/

Mutations detected in samples.
BRAFV600 (Sanger) | Melanoma (n=45) | Associated nevus (n=46) | Control nevus (n=25) | Matched melanoma (n=28) -- | -- | -- | -- | -- V600E | 51.1% (n=23) | 63.0% (n=29) | 52.0% (n=13) | 39.3% (n=11) V600K | 0 | 0 | 4.0% (n=1) | 0 Wildtype | 48.9% (n=22) | 37.0% (n=17) | 44.0% (n=11) | 60.7% (n=17) **BRAFV600E (Sanger + VE1 IHC)** | **Melanoma (n=46)** | **Associated nevus (n=46)** | **Control nevus (n=25)** | **Matched melanoma (n=29)** V600E | 63.0% (n=29) | 65.2% (n=30) | 54.2% (n=13) | 41.4% (n=12) Wildtype | 37.0% (n=17) | 34.8% (n=16) | 48.0% (n=12) | 58.6% (n=17) **NRAS Exon 2 (Sanger)** | **Melanoma (n=42)** | **Associated nevus (n=44)** | **Control nevus (n=21)** | **Matched melanoma (n=26)** Silent mutations | 2.4% (n=1; A66A) | 2.3% (n=1; L52L) | 0 | 0 Q61K | 4.8% (n=2) | 4.5% (n=2) | 14.3% (n=3) | 0 Q61L | 2.4% (n=1) | 2.3% (n=1) | 0 | 0 Q61R | 2.4% (n=1) | 9.1% (n=4) | 0 | 7.7% (n=2) Wildtype | 88.1% (n=37) | 81.8% (n=36) | 85.7% (n=18) | 92.3% (n=24)

This should really be 3 separate tables so I'm not sure how we can parse this properly. Right now it assumes anything in the tbody isn't a header

creisle commented 1 year ago

Source XML

<table frame="hsides" rules="groups">
            <colgroup span="1">
              <col span="1"/>
              <col span="1"/>
              <col span="1"/>
              <col span="1"/>
              <col span="1"/>
            </colgroup>
            <thead>
              <tr>
                <th rowspan="1" colspan="1">
<bold>BRAF<sup>V600</sup> (Sanger)</bold>
</th>
                <th rowspan="1" colspan="1">
<bold>Melanoma (n=45)</bold>
</th>
                <th rowspan="1" colspan="1">
<bold>Associated nevus (n=46)</bold>
</th>
                <th rowspan="1" colspan="1">
<bold>Control nevus (n=25)</bold>
</th>
                <th rowspan="1" colspan="1">
<bold>Matched melanoma (n=28)</bold>
</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td rowspan="1" colspan="1">V600E</td>
                <td rowspan="1" colspan="1">51.1% (n=23)</td>
                <td rowspan="1" colspan="1">63.0% (n=29)</td>
                <td rowspan="1" colspan="1">52.0% (n=13)</td>
                <td rowspan="1" colspan="1">39.3% (n=11)</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">V600K</td>
                <td rowspan="1" colspan="1">0</td>
                <td rowspan="1" colspan="1">0</td>
                <td rowspan="1" colspan="1">4.0% (n=1)</td>
                <td rowspan="1" colspan="1">0</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Wildtype</td>
                <td rowspan="1" colspan="1">48.9% (n=22)</td>
                <td rowspan="1" colspan="1">37.0% (n=17)</td>
                <td rowspan="1" colspan="1">44.0% (n=11)</td>
                <td rowspan="1" colspan="1">60.7% (n=17)</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">
<bold>BRAF<sup>V600E</sup> (Sanger + VE1 IHC)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Melanoma (n=46)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Associated nevus (n=46)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Control nevus (n=25)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Matched melanoma (n=29)</bold>
</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">V600E</td>
                <td rowspan="1" colspan="1">63.0% (n=29)</td>
                <td rowspan="1" colspan="1">65.2% (n=30)</td>
                <td rowspan="1" colspan="1">54.2% (n=13)</td>
                <td rowspan="1" colspan="1">41.4% (n=12)</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Wildtype</td>
                <td rowspan="1" colspan="1">37.0% (n=17)</td>
                <td rowspan="1" colspan="1">34.8% (n=16)</td>
                <td rowspan="1" colspan="1">48.0% (n=12)</td>
                <td rowspan="1" colspan="1">58.6% (n=17)</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">
<bold>NRAS Exon 2 (Sanger)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Melanoma (n=42)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Associated nevus (n=44)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Control nevus (n=21)</bold>
</td>
                <td rowspan="1" colspan="1">
<bold>Matched melanoma (n=26)</bold>
</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Silent mutations</td>
                <td rowspan="1" colspan="1">2.4% (n=1; A66A)</td>
                <td rowspan="1" colspan="1">2.3% (n=1; L52L)</td>
                <td rowspan="1" colspan="1">0</td>
                <td rowspan="1" colspan="1">0</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Q61K</td>
                <td rowspan="1" colspan="1">4.8% (n=2)</td>
                <td rowspan="1" colspan="1">4.5% (n=2)</td>
                <td rowspan="1" colspan="1">14.3% (n=3)</td>
                <td rowspan="1" colspan="1">0</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Q61L</td>
                <td rowspan="1" colspan="1">2.4% (n=1)</td>
                <td rowspan="1" colspan="1">2.3% (n=1)</td>
                <td rowspan="1" colspan="1">0</td>
                <td rowspan="1" colspan="1">0</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Q61R</td>
                <td rowspan="1" colspan="1">2.4% (n=1)</td>
                <td rowspan="1" colspan="1">9.1% (n=4)</td>
                <td rowspan="1" colspan="1">0</td>
                <td rowspan="1" colspan="1">7.7% (n=2)</td>
              </tr>
              <tr>
                <td rowspan="1" colspan="1">Wildtype</td>
                <td rowspan="1" colspan="1">88.1% (n=37)</td>
                <td rowspan="1" colspan="1">81.8% (n=36)</td>
                <td rowspan="1" colspan="1">85.7% (n=18)</td>
                <td rowspan="1" colspan="1">92.3% (n=24)</td>
              </tr>
            </tbody>
          </table>
creisle commented 1 year ago

Seems like the table is just a weird format, even in the original non-PMC article. I will leave this for now but I am documenting it here in case we run into another case, then we may want to come up with a possible solution