pinder-org / pinder

PINDER: The Protein INteraction Dataset and Evaluation Resource
https://pinder-org.github.io/pinder/
Apache License 2.0
94 stars 7 forks source link

Robustifying download_entry #25

Open alex-hh opened 1 month ago

alex-hh commented 1 month ago

Hi,

Thanks for the amazing resource and tools.

I'm currently working with a script which iterates over a subset of the index instantiating each pinder system then processing further.

I seem to be encountering occasional errors in the files retrieved via download_entry: sometimes a pdb file will be truncated or empty.

I assume this could be due to connection issues or some kind of non-robust behaviour in the Gsutil cp_paths method? Wonder if you have any suggestions for fixing or more robust workarounds?

Something that is slightly frustrating is that the download entry call itself doesn't throw any kind of exception - but I get errors downstream.

Appreciate this might not be that easy to reproduce - it happens fairly infrequently but is reliably affecting my script.

danielkovtun commented 1 month ago

Hi @alex-hh , thanks for reporting the issue. Do you have any stack traces available or know of specific PDB files that this was happening with? I want to verify that it is not a problem with the PDB files themselves.

In general, it's surprising that an incomplete download would not trigger an exception during download via the process_many multi-threaded download function that is used by the Gsutil.cp_paths method: https://github.com/pinder-org/pinder/blob/9c70a92119b844d0d20e35f483b4f1f26b2899c4/src/pinder-core/pinder/core/utils/cloud.py#L378-L397

Also, while this suggestion may be prohibitive if you are limited on disk space, in general I would recommend doing a bulk download of the full dataset vs. on-the-fly downloads via download_entry.

As far as workarounds for download_entry (on-the-fly) version, perhaps the simplest options I can think of are handling the exception you encounter while loading the structure via PinderSystem.load_structure and if the specific exception is raised, you could run pdb_file.unlink() followed by another call to PinderSystem().download_entry to force re-download the file.

Would be interested in knowing which exception is raised where.

alex-hh commented 1 month ago

Hi - thanks for the suggestions!

I've switched to not using the multithreading gsutil version but am not sure yet whether this is helping or whether the issue I was experiencing was somewhere else.

Will let you know if I'm able to trace down the error.