jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Better table header handling #14

Open creisle opened 1 year ago

creisle commented 1 year ago

I've been using the lineraized tables but one thing I've noticed is that when we have something complex like a multi-level header just linearizing makes the number of cells not always match up. so something like this

image

example article used: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873663/

Currently gets turned into

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb FEATURESc SOMATIC GERMLINE FAMILIES TOTAL BREAST

And we lose a lot of meaning, not to mention it becomes impossible to match these up properly to the cells text from the body of the table. (see below)

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb FEATURESc SOMATIC GERMLINE FAMILIES TOTAL BREAST
T125R ALTERED 2 1 0

So i'd like to try something more complex where we simplfiy the header into a single row before we linearize but it would require making the text differ slightly from the original by repeating some words which I am not sure on. The end results would look like this

p53 MUTATION FUNCTIONALa STATUS IARC DATABASEb SOMATIC TOTAL IARC DATABASEb SOMATIC BREAST IARC DATABASEb GERMLINE FAMILIES FEATURESc
T125R ALTERED 2 1 0

@jakelever what do you think? I've already been implementing this for my own purposes but would be happy to put up a PR if you like the idea

jakelever commented 1 year ago

Sure. Go for it. I don't really use the table information that much, but this does remind me that I need to make sure that CIViCmine is properly aware of the tables. Remind me, there's some metadata on these passages that indicate that they are tables, right?

creisle commented 1 year ago

Sure. Go for it. I don't really use the table information that much, but this does remind me that I need to make sure that CIViCmine is properly aware of the tables. Remind me, there's some metadata on these passages that indicate that they are tables, right?

yup, the xml_path infon can be used for that

jakelever commented 1 year ago

Thanks