pubget._entrez Response failed to validate - Githubissues

neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.

https://neuroquery.github.io/pubget/

MIT License

20 stars 12 forks source link

pubget._entrez Response failed to validate #21

Closed adelavega closed 1 year ago

adelavega commented 1 year ago

I'm running a query using pubget, but for about 4/778 of the entrez batches, I get the following error:

ERROR 2023-03-06T12:48:44-0600 pubget._entrez Response failed to validate (reason: Status code 400 != 200) for url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

two questions: 1) any way to diagnose the error? 2) if not, i'm totally fine with ignoring this subset, but i'm unable to, because when I run extract_articles, I get the following `warning:

WARNING 2023-03-06T10:25:31-0600        pubget._articles        Previous processing step 'download' was not completed: not all the articles matching the query will be processed.

and finally the following error:

lxml.etree.XMLSyntaxError: Premature end of data in tag pmc-articleset line 3, line 258852, column 1

Any way to robustly continue in this situation, and process the rest of the articles?

Would deleting the offending xml file suffice?

adelavega commented 1 year ago

FWIW, I tried running this again, and I got another API error, but for whatever reason this time it did not cause the next command (extract_articles) any issues...

jeromedockes commented 1 year ago

hi, thanks for reporting!

1. for diagnosing the download error, unfortunately ATM all we have is the log which doesn't tell us much. I should definitely change it so it dumps the error responses in a separate folder to help figure out what went wrong. the 400 status code could be a hint but I've noticed the status codes sent by the eutils are pretty random. still could you share the full log if you have it? for example I'd like to check if it retried the batch in question and got a 400 response every time
1. the warning is due to some batches missing, (those that failed to download), but I don't think that's what is causing the error. most likely the content of some other batch is truncated. or I have also seen similar error messages due to a broken build of lxml, did you by any chance install lxml with conda? In any case if you still have the log and the offending xml file that would be a big help.

jeromedockes commented 1 year ago

deleting the problematic xml file and running pubget run again should work, as it will re-attempt to download the missing file and then proceed with the next steps
Re: running it again, it is possible that you see an API error but the API call was then retried and succeeded. The full log should help figure it out if you still have it

thanks a lot for reporting these things. basically I think 2 things happened: 1 some batches failed to download, that will cause a warning but it's ok and re-running pubget will add these batches (only ) 2 one batch seemed like it was downloaded correctly, with a 200 response and containing an <articleset> tag, but the content was truncated. ATM the code that checks the batches just looks for that tag but doesn't parse the whole file because it takes some time, but if indeed that is what happened I think we should indeed parse the whole response and check it's well-formed XML and contains the expected number of articles

jeromedockes commented 1 year ago

if you could also LMK the pubget version and the query I'll try to see if I can reproduce the problem -- it probably happens sporadically but it doesn't hurt to try

adelavega commented 1 year ago

I believe I have the latest version, potentially even installed from github (0.0.7). running this example query:

https://github.com/neuroquery/pubget/blob/main/docs/example_queries/journal_list_fmri_vbm.txt

I think I delete the log sorry :( I deleted the folder and tried the whole process again (which again-- succeeded)

jeromedockes commented 1 year ago

no problem, thanks!! if the next one succeeded that tells us lxml is working properly, so a truncated batch is probably what happened. I see it complains about EOF at line 258K, I've looked at a few successfully downloaded batches and they're all around 800K lines. I think I'll just start parsing and validating the full content, and maybe add more retries with an increasing delay. and in the second attempt it got all the batches, ie you don't have the "previous step is not complete" warning? If you do have the warning, you can just re-run the same command (without deleting anything) and it will download the missing batches, then re-execute the subsequent steps. If you have the log for the successful download I'd be interested in that too, but if not no worries

jeromedockes commented 1 year ago

BTW you can use pubget run to run all the steps without having to manually run pubget download, then pubget extract etc

jeromedockes commented 1 year ago

I think this should be addressed by #24 and #23 . but if not feel free to reopen or if we have time we can also talk briefly about pubget, ACE & neurostuff on Thursday!