neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

Extracting table & analysis-level metadata #28

Open adelavega opened 1 year ago

adelavega commented 1 year ago

Hi @jeromedockes

I know that pubget extracts the table_id and table_label along side the coordinates.

Does pubget have the ability to, or is there interest in expanding pubget to extract more meta-data, such as: p-value, region, or contrast name?

At the level of the table we could also try to extract Table caption.

I'm asking because it seems ACE can do this, and it may be helpful meta-data for neurosynth-compose users.

jeromedockes commented 1 year ago

Hi!

for the table caption: good idea, actually we already extract it so it is just a question of writing it in a more visible place and documenting it. ATM it is buried in the 'articles' directory, eg in query_a64755ef68b219b22aec44cd9fecdb07/articles/d6e/pmcid_9812244/tables/table_000_info.json

for p-value, region, and contrast name: ATM pubget does not have the ability to extract them. I think it would probably be useful, but like extracting coordinates it would require a good amount of trial and error, and then some work to estimate if the extraction is somewhat accurate. I would be inclined to wait and see if or how many users ask for it, once the tools have been advertised a bit more. However I think it is well within the scope of the project so if someone wants to tackle it (including the validation) that would be a welcome contribution.

Do you know if the accuracy of extraction of p-value region and contrast name by ACE has been evaluated? and AFAIK neurosynth doesn't use them in any way, is that correct?

adelavega commented 1 year ago

That makes sense.

Indeed ACE has a lot trial and error in the heuristics it uses to get those fields (although p-value i rather easy actually--- its contrast name and region that's a bit harder).

I actually have an undergrad working on doing some QA on newly extract ACE data, so I can have him take a look at that.

Neurosynth doesn't use this data, nor is it in the official neurosynth data output. I'm working on changing that, so at least neurostore can ingest it

jeromedockes commented 1 year ago

That's great, the QA work combined with manually curated coordinates from neurostore will produce some good validation data for future improvements of tables processing, both in ACE and in pubget

jeromedockes commented 10 months ago

ATM it is buried in the 'articles' directory, eg in query_a64755ef68b219b22aec44cd9fecdb07/articles/d6e/pmcid_9812244/tables/table_000_info.json

opened #38 to deal with this