neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

pubget._entrez Response failed to validate #21

Closed adelavega closed 1 year ago

adelavega commented 1 year ago

I'm running a query using pubget, but for about 4/778 of the entrez batches, I get the following error:

ERROR 2023-03-06T12:48:44-0600 pubget._entrez Response failed to validate (reason: Status code 400 != 200) for url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

two questions: 1) any way to diagnose the error? 2) if not, i'm totally fine with ignoring this subset, but i'm unable to, because when I run extract_articles, I get the following `warning:

WARNING 2023-03-06T10:25:31-0600        pubget._articles        Previous processing step 'download' was not completed: not all the articles matching the query will be processed.                                                                                                 

and finally the following error:

lxml.etree.XMLSyntaxError: Premature end of data in tag pmc-articleset line 3, line 258852, column 1                                                                                                                                                                              

Any way to robustly continue in this situation, and process the rest of the articles?

Would deleting the offending xml file suffice?

adelavega commented 1 year ago

FWIW, I tried running this again, and I got another API error, but for whatever reason this time it did not cause the next command (extract_articles) any issues...

jeromedockes commented 1 year ago

hi, thanks for reporting!

jeromedockes commented 1 year ago

thanks a lot for reporting these things. basically I think 2 things happened: 1 some batches failed to download, that will cause a warning but it's ok and re-running pubget will add these batches (only ) 2 one batch seemed like it was downloaded correctly, with a 200 response and containing an <articleset> tag, but the content was truncated. ATM the code that checks the batches just looks for that tag but doesn't parse the whole file because it takes some time, but if indeed that is what happened I think we should indeed parse the whole response and check it's well-formed XML and contains the expected number of articles

jeromedockes commented 1 year ago

if you could also LMK the pubget version and the query I'll try to see if I can reproduce the problem -- it probably happens sporadically but it doesn't hurt to try

adelavega commented 1 year ago

I believe I have the latest version, potentially even installed from github (0.0.7). running this example query:

https://github.com/neuroquery/pubget/blob/main/docs/example_queries/journal_list_fmri_vbm.txt

I think I delete the log sorry :( I deleted the folder and tried the whole process again (which again-- succeeded)

jeromedockes commented 1 year ago

no problem, thanks!! if the next one succeeded that tells us lxml is working properly, so a truncated batch is probably what happened. I see it complains about EOF at line 258K, I've looked at a few successfully downloaded batches and they're all around 800K lines. I think I'll just start parsing and validating the full content, and maybe add more retries with an increasing delay. and in the second attempt it got all the batches, ie you don't have the "previous step is not complete" warning? If you do have the warning, you can just re-run the same command (without deleting anything) and it will download the missing batches, then re-execute the subsequent steps. If you have the log for the successful download I'd be interested in that too, but if not no worries

jeromedockes commented 1 year ago

BTW you can use pubget run to run all the steps without having to manually run pubget download, then pubget extract etc

jeromedockes commented 1 year ago

I think this should be addressed by #24 and #23 . but if not feel free to reopen or if we have time we can also talk briefly about pubget, ACE & neurostuff on Thursday!