greenelab / pubtator

Retrieve and process PubTator annotations
Other
43 stars 9 forks source link

Modify execute.sh to only download missing/incomplete files? #22

Open khughitt opened 4 years ago

khughitt commented 4 years ago

Currently, execute.sh will re-download all files each time it is run, regardless of whether the files have already been successfully downloaded and processed.

Since it requires downloading files which are quite large, there is a decent chance that the script will need to be run more than once due to interrupted downloads.

It would be great if the script could check each file to see if it has been fully downloaded, and only download those which are missing/incomplete.

One possible approach might be to generate md5sums for each output, and check against this, at least for the files that are only updated at periodic intervals.

Thanks for taking the time to put this together and share it with the community!

khughitt commented 4 years ago

Also, if it's possible to post the md5sums for the downloaded/generated files in the meantime, that would be quite helpful :)

khughitt commented 4 years ago

Last thought -- I'm not sure how consistent the batch query results are from day-to-day, but if they are stable (or can be queried in such a way to make them so), it could also be useful to reuse the data/temp/batch_xx.xml files instead of re-querying them each time.

danich1 commented 4 years ago

Currently, execute.sh will re-download all files each time it is run, regardless of whether the files have already been successfully downloaded and processed.

Since it requires downloading files which are quite large, there is a decent chance that the script will need to be run more than once due to interrupted downloads.

It would be great if the script could check each file to see if it has been fully downloaded, and only download those which are missing/incomplete.

One possible approach might be to generate md5sums for each output, and check against this, at least for the files that are only updated at periodic intervals.

Thanks for reaching out. This is a great idea. I agree that there should be some countermeasures in place to handle interrupted downloads. My plate is a bit full at the moment, but will definitely implement your suggestion when I get the chance.

Also, if it's possible to post the md5sums for the downloaded/generated files in the meantime, that would be quite helpful :)

Sounds good. Will post as soon as the file has finished being generated. One warning though I think enough time has lapsed that the md5sums I'm generating might not match the ones you have on your local machine. Let me know if this is the case.

Last thought -- I'm not sure how consistent the batch query results are from day-to-day, but if they are stable (or can be queried in such a way to make them so), it could also be useful to reuse the data/temp/batch_xx.xml files instead of re-querying them each time.

Thanks for the suggestion. Will need to do more testing before I can confirm batch stability, but I'll keep this idea in mind.

danich1 commented 4 years ago

As promised here are the md5sums: temp_batch_md5sum.txt

khughitt commented 4 years ago

Sounds good! Thanks for the quick response and md5sums -- it's much appreciated. Happy to help if you need any help testing in the future.

khughitt commented 4 years ago

@danich1 A bit late, but, I finally got around to checking the md5sums. I think there may be an issue with the version of the data used to generate the checksums you posted -- there are a large number of redundant md5sums presents. When I compare the md5sums for the batches I have been able to download previously, they are all unique, so perhaps there was an issue with an earlier version of the script used to generate the data you based the checksums off of?

danich1 commented 4 years ago

When I compare the md5sums for the batches I have been able to download previously, they are all unique, so perhaps there was an issue with an earlier version of the script used to generate the data you based the checksums off of?

Interesting. I'd reckon the redundant md5sums are based on the empty xml files that are returned from the server as I'm checking nearly all pmcids against pubtator central's api. Last time I ran it there were a lot of "empty files".

khughitt commented 4 years ago

Hmm. I wonder if the API might have been down / inaccessible the last time the pipeline was run? So far I have 10,530 results, all of which are non-empty.

From the code for download_full_text.py it appears that you are checking the query response code and raising an exception for non-successes, so I would imagine that would have been detected..

danich1 commented 4 years ago

Hmm. I wonder if the API might have been down / inaccessible the last time the pipeline was run?

When did you run my pipeline? The files I was working with date back to February 14th. I'm pretty sure in that span of time (Feb 14th to now) researchers at PMC incorporated a lot more tagged text.

khughitt commented 4 years ago

Ah okay, most of the files were downloaded in June, so that could be it.

danich1 commented 4 years ago

Ah okay, most of the files were downloaded in June, so that could be it.

Yeah. I've been working on other projects, so I haven't gotten around to updating the downloaded files atm. I do want to note that this repository has been updated to a new form of execution/batch tracker. This time you can start anywhere within the pipeline and it produces a batch log to keep track of "tested PMIDS". Might be a bit too late as it seems you are far within the parsing pipeline, but worth taking a look.