neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

add a `table_info.csv` file in `extractedData` #38

Closed jeromedockes closed 2 months ago

jeromedockes commented 10 months ago

at the moment pubget outputs some data for each table: a csv file with the content of the table, and a json file with keys like table id, table label, table caption, and the path of the csv file

those are a bit hard to discover & use because they are stored in the article's individual directories, eg in query_0dbec5035dc626ca916d0317b7b9a76f/articles/0b8/pmcid_7664275/tables/

it would be helpful to create a csv in the extractedData directory grouping the info for all tables in all articles. columns could be pmcid,table_number,table_id,table_label,table_caption,n_header_rows,table_data_file

table_data_file would be the path to the table's csv, probably relative to the root of the query directory. the other keys would be those currently provided in the JSON

jeromedockes commented 10 months ago

also maybe extract the content of the table-wrap-foot elements? those are footnotes so not sure if they contain useful information

adelavega commented 10 months ago

This would be great!

jeromedockes commented 10 months ago

also maybe extract the content of the table-wrap-foot elements? those are footnotes so not sure if they contain useful information

this part is done in #40