neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
15 stars 11 forks source link

Re-try downloading ArticleSet XML files with syntax errors (during parsing) #47

Open adelavega opened 1 month ago

adelavega commented 1 month ago

Several articleset XML file have contained unparsable errors. Not sure why, my guess is PMC sometimes returns file with errors, or its' an HTTP transmission error?

Nonetheless, It would be useful to allow partial re-downloading of specific xml files, or have a parsing method robust to such errors (i.e. skip article, not entire file).

jeromedockes commented 1 month ago

PMC definitely returns bad responses quite frequently, but normally pubget does a quick sanity check on the response and retries the download if it seems to have failed. I'm guessing the sanity check is too superficial, could you share an example file and maybe the command you used (although it will be hard to reproduce). also which version of pubget are you using?

adelavega commented 1 month ago

Here's some example XML error. I went through and fixed them manually for a few files, which wasn't the end of the world:

    <td align="left" rowspan="1" colspan4"1">0.8461</td>
<publisher-loc>New York</publisher-lo`>
<label>51</habel>

Basically just randomly incorrect characters.

I believe i was using pubget 0.0.9.dev but I may have upgraded to debug

adelavega commented 1 month ago

Unfortunately you'd have to parse the whole file to see if its faulty. Might be useful to tie in the re-downloading w/ the parsing step.

adelavega commented 1 month ago

Hmm, this is weird actually

This is the validator correct? https://github.com/neuroquery/pubget/blob/b1a293c515465bf1aabbcce38ea9e0e58e11a34a/src/pubget/_entrez.py#L37

That line on the faulty xml file (on disk) fails, so I'm surprised it was not re-downloaded.

I'm trying again hoping that perhaps I was using an old version of pubget