Open khughitt opened 4 years ago
Also, if it's possible to post the md5sums for the downloaded/generated files in the meantime, that would be quite helpful :)
Last thought -- I'm not sure how consistent the batch query results are from day-to-day, but if they are stable (or can be queried in such a way to make them so), it could also be useful to reuse the data/temp/batch_xx.xml
files instead of re-querying them each time.
Currently,
execute.sh
will re-download all files each time it is run, regardless of whether the files have already been successfully downloaded and processed.Since it requires downloading files which are quite large, there is a decent chance that the script will need to be run more than once due to interrupted downloads.
It would be great if the script could check each file to see if it has been fully downloaded, and only download those which are missing/incomplete.
One possible approach might be to generate md5sums for each output, and check against this, at least for the files that are only updated at periodic intervals.
Thanks for reaching out. This is a great idea. I agree that there should be some countermeasures in place to handle interrupted downloads. My plate is a bit full at the moment, but will definitely implement your suggestion when I get the chance.
Also, if it's possible to post the md5sums for the downloaded/generated files in the meantime, that would be quite helpful :)
Sounds good. Will post as soon as the file has finished being generated. One warning though I think enough time has lapsed that the md5sums I'm generating might not match the ones you have on your local machine. Let me know if this is the case.
Last thought -- I'm not sure how consistent the batch query results are from day-to-day, but if they are stable (or can be queried in such a way to make them so), it could also be useful to reuse the
data/temp/batch_xx.xml
files instead of re-querying them each time.
Thanks for the suggestion. Will need to do more testing before I can confirm batch stability, but I'll keep this idea in mind.
As promised here are the md5sums: temp_batch_md5sum.txt
Sounds good! Thanks for the quick response and md5sums -- it's much appreciated. Happy to help if you need any help testing in the future.
@danich1 A bit late, but, I finally got around to checking the md5sums. I think there may be an issue with the version of the data used to generate the checksums you posted -- there are a large number of redundant md5sums presents. When I compare the md5sums for the batches I have been able to download previously, they are all unique, so perhaps there was an issue with an earlier version of the script used to generate the data you based the checksums off of?
When I compare the md5sums for the batches I have been able to download previously, they are all unique, so perhaps there was an issue with an earlier version of the script used to generate the data you based the checksums off of?
Interesting. I'd reckon the redundant md5sums are based on the empty xml files that are returned from the server as I'm checking nearly all pmcids against pubtator central's api. Last time I ran it there were a lot of "empty files".
Hmm. I wonder if the API might have been down / inaccessible the last time the pipeline was run? So far I have 10,530 results, all of which are non-empty.
From the code for download_full_text.py
it appears that you are checking the query response code and raising an exception for non-successes, so I would imagine that would have been detected..
Hmm. I wonder if the API might have been down / inaccessible the last time the pipeline was run?
When did you run my pipeline? The files I was working with date back to February 14th. I'm pretty sure in that span of time (Feb 14th to now) researchers at PMC incorporated a lot more tagged text.
Ah okay, most of the files were downloaded in June, so that could be it.
Ah okay, most of the files were downloaded in June, so that could be it.
Yeah. I've been working on other projects, so I haven't gotten around to updating the downloaded files atm. I do want to note that this repository has been updated to a new form of execution/batch tracker. This time you can start anywhere within the pipeline and it produces a batch log to keep track of "tested PMIDS". Might be a bit too late as it seems you are far within the parsing pipeline, but worth taking a look.
Currently,
execute.sh
will re-download all files each time it is run, regardless of whether the files have already been successfully downloaded and processed.Since it requires downloading files which are quite large, there is a decent chance that the script will need to be run more than once due to interrupted downloads.
It would be great if the script could check each file to see if it has been fully downloaded, and only download those which are missing/incomplete.
One possible approach might be to generate md5sums for each output, and check against this, at least for the files that are only updated at periodic intervals.
Thanks for taking the time to put this together and share it with the community!