Open adelavega opened 1 month ago
PMC definitely returns bad responses quite frequently, but normally pubget does a quick sanity check on the response and retries the download if it seems to have failed. I'm guessing the sanity check is too superficial, could you share an example file and maybe the command you used (although it will be hard to reproduce). also which version of pubget are you using?
Here's some example XML error. I went through and fixed them manually for a few files, which wasn't the end of the world:
<td align="left" rowspan="1" colspan4"1">0.8461</td>
<publisher-loc>New York</publisher-lo`>
<label>51</habel>
Basically just randomly incorrect characters.
I believe i was using pubget 0.0.9.dev
but I may have upgraded to debug
Unfortunately you'd have to parse the whole file to see if its faulty. Might be useful to tie in the re-downloading w/ the parsing step.
Hmm, this is weird actually
This is the validator correct? https://github.com/neuroquery/pubget/blob/b1a293c515465bf1aabbcce38ea9e0e58e11a34a/src/pubget/_entrez.py#L37
That line on the faulty xml file (on disk) fails, so I'm surprised it was not re-downloaded.
I'm trying again hoping that perhaps I was using an old version of pubget
Several articleset XML file have contained unparsable errors. Not sure why, my guess is PMC sometimes returns file with errors, or its' an HTTP transmission error?
Nonetheless, It would be useful to allow partial re-downloading of specific xml files, or have a parsing method robust to such errors (i.e. skip article, not entire file).