neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

Searching with spaces #29

Closed adelavega closed 1 year ago

adelavega commented 1 year ago

This might be a silly question but I search

"functional MRI"[Abstract] in PMC and got >6000 hits

However, in pubget I tried to make the same query using a file with the contents: (functional MRI[Abstract])

and I only got ~3k hits.

Any idea why the discrepancy or what I'm doing wrong?

BTW, goal here is to add studies that don't mention "fMRI" but do mention "functional MRI". This seems to account for a fair amount of "missing" studies.

jeromedockes commented 1 year ago

I think that might be the same question as #27 -- by default the PMC web interface will show all matches, but pubget only downloads those that are in the open access subset, which is usually around half. to see those in the website, on the left in the list of "article attributes" click on "open access"

interestingly when I search PMC with that query I only get 5921 results -- but that's close

jeromedockes commented 1 year ago

BTW, goal here is to add studies that don't mention "fMRI" but do mention "functional MRI". This seems to account for a fair amount of "missing" studies.

interesting! I'll add it to this example query and run it again to see how many papers we get

adelavega commented 1 year ago

Ah yes, that explains it. I meant ~6000, so the same number you got actually. Closing.

For studies that are indexed in PMC but are not open access, does that mean NIH has the manuscript but it's not freely accessible?

This is actually good to know because my currently strategy using ACE hsa been to skip studies that have a pubmed central ID, but now I'm worried that I'm skipping a lot of studies that I can't get through PubMedCentral but have an ID assigned.

adelavega commented 1 year ago

Regarding "functional MRI" as a query, I compared it to the example query, and I was able to find an additional 124 studies with coordinates, so a little over 10% more

jeromedockes commented 1 year ago

For studies that are indexed in PMC but are not open access, does that mean NIH has the manuscript but it's not freely accessible?

This is actually good to know because my currently strategy using ACE hsa been to skip studies that have a pubmed central ID, but now I'm worried that I'm skipping a lot of studies that I can't get through PubMedCentral but have an ID assigned.

yes, unfortunately. for a quite large number of articles (I would say close to half), NIH has the manuscript and you can read it online on the PMC website, but it cannot be used for text-mining. if you try to download one of those articles from the API you get an error message like "this article's publisher forbids distributing it in XML format"