jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Table normalization cases #20

Open creisle opened 1 year ago

creisle commented 1 year ago
Case Example Article License table index Tests
header hierarchical colspans PMC5029658 CC-BY 1,2,3
body hierarchical rowspans PMC5029658 CC-BY 0
header hierarchical colspans PMC2873663 author version redundant
body full colspans PMC7461630 author version redundant
body full colspans PMC4919728 CC BY-NC-ND 0
body partial colspans PMC4049792 CC-BY NC 0
paragraphs inside table cells PMC6580637 CC-BY 2
empty cells that should be repeated PMC4816447 CC-BY 0
creisle commented 3 months ago

Case 1 - hierarchical header rows

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5029658/

image

proposed solution:

Cell Line SUP-M2 IC50
TKI Crizotinib 67.75
TKI Ceritinib 15.57

Case 2 - in body colspans

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7461630/

image

proposed solution

All patients in NTRK gene fusion-positive efficacy-evaluable population (n=54)
Age, years 58 (48-67)
Sex: Female 32 (59%)
Sex: Male 22 (41%)
creisle commented 3 months ago

Case 3 - In body partial colspans

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4049792/table/T1/?report=objectonly

image

proposed solution

Total Female Male P value
n = 1506 617 (41%) 889 (59%)
Age 61 ± 11.3 59 ± 12.1 61 ± 11.2 0.014
Tumor site (right vs left): Right 365 (24.2%) 177 (28.7%) 188 (21.1%) 0.001
Tumor site (right vs left): Left 1141 (75.8%) 440 (71.3%) 701 (78.9%) 0.001
Tumor site (right vs left vs rectum): Right 365 (24.2%) 177 (28.7%) 188 (21.1%) < 0.0001
Tumor site (right vs left vs rectum): Left 538 (35.7%) 228 (40.0%) 310 (34.9%) < 0.0001
Tumor site (right vs left vs rectum): Rectum 603 (40.1%) 212 (34.3%) 391 (44.0%) < 0.0001

@jakelever does this make sense to you? I am not sure about repeating the p-value since its for the 2-way and 3-way tests but there's not another way to make this one work

creisle commented 3 months ago

Here's one where the table XML is not formatted correctly so it didn't use a rowspan for the exon number despite it clearly being intended that way. Not sure there is a good way to fix this but i've put a test in that is skipped for now should we find a way to resolve this in the future

image
creisle commented 2 months ago

Another hard to interpret table example

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030433/table/T1/?report=objectonly

image
jakelever commented 2 months ago

These are tricky tables. Your proposed solution for Cases 1-3 look reasonable to me