Open creisle opened 1 year ago
Sure. Go for it. I don't really use the table information that much, but this does remind me that I need to make sure that CIViCmine is properly aware of the tables. Remind me, there's some metadata on these passages that indicate that they are tables, right?
Sure. Go for it. I don't really use the table information that much, but this does remind me that I need to make sure that CIViCmine is properly aware of the tables. Remind me, there's some metadata on these passages that indicate that they are tables, right?
yup, the xml_path infon can be used for that
Thanks
I've been using the lineraized tables but one thing I've noticed is that when we have something complex like a multi-level header just linearizing makes the number of cells not always match up. so something like this
example article used: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873663/
Currently gets turned into
And we lose a lot of meaning, not to mention it becomes impossible to match these up properly to the cells text from the body of the table. (see below)
So i'd like to try something more complex where we simplfiy the header into a single row before we linearize but it would require making the text differ slightly from the original by repeating some words which I am not sure on. The end results would look like this
@jakelever what do you think? I've already been implementing this for my own purposes but would be happy to put up a PR if you like the idea