plinder-org / plinder

Protein Ligand INteraction Dataset and Evaluation Resource
https://plinder.sh
Apache License 2.0
140 stars 8 forks source link

Getting train/val/test data #20

Closed patrickbryant1 closed 1 month ago

patrickbryant1 commented 1 month ago

Hi,

Thanks for creating this dataset - it looks nice!

Can you please provide a way to get the train/val/test splits you provide without having to read so much documentation? E.g wget /path/to/cloud/[train,val,test].compr

Best,

Patrick

patrickbryant1 commented 1 month ago

Hi,

I would appreciate some help here as I find the data incomprehensible.

  1. Where do I find e.g. the PL50 split? This is not annotated in the parquet file gs://plinder/2024-06/v2/splits/splits_metaflow_config_split_v2_split_single_graph_yaml.parquet
  2. Where are the structures? These are not available in the json files as far as I can see?

As I understand it, your objective here is to make the data easily available. I can assure you that very few can parse it as it is now.

Best,

Patrick

tjduigna commented 1 month ago

Hi Patrick,

The easiest way to download the entire dataset would be with the gsutil or gcloud storage command line tool, as the dataset is hosted in a google cloud storage bucket.

export PLINDER_RELEASE=2024-06
export PLINDER_ITERATION=v2
gsutil -m cp -r gs://plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/* ~/.local/share/plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/

See https://cloud.google.com/storage/docs/gsutil_install for installation details.

I am sorry that you find the data incomprehensible, we are still working on the API to make it easier to consume. Most of this functionality has been included under the umbrella of the pytorch data loader that we aim to provide and is being tracked in #15.

  1. The PL50 split is this file: gs://plinder/2024-04/v1/splits/batch_6/splits_metaflow_config_split_batch_6_9_yaml_e9ca06e682c3cb2f9340542d1ec1f6dc.parquet, note that for this split, PLINDER_RELEASE=2024-04 and PLINDER_ITERATION=v1. Please understand that we must be careful about making code changes that may alter the dataset and have adopted this approach to ensure there are no issues stemming from iterative development.* Edit: For the plinder-v0 split referenced in the manuscript, please see gs://plinder/2024-04/v1/splits/v0/plinder-v0.parquet

  2. The structures themselves are archived into {two_char_code}.zip files and available under the gs://plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems/ directory. Do note that, compressed, the collection of systems archives is >140GB and uncompressed closer to 1TB. See here for a reference implementation of how one would "easily" go between tables containing system IDs and their corresponding structure files.

Please feel free to keep us updated with all pain points you're experiencing! This is useful feedback we can use to guide further development.

Thanks, Tom

patrickbryant1 commented 1 month ago

Hi Tom,

Thanks, but this doesn't solve the issue.

  1. You don't provide the ids for version 2?
  2. How come the path to v1 is so illogical? Why not name it PL50?
  3. It is not clear how to get the structure info. Surely, one should not have to load all the data in RAM and then write it to PDB to access it? It would be nice if you could provide PDB files for the proteins and sdf/mol for the ligands including SMILES like all other benchmarks.

Best,

Patrick

jamaliki commented 1 month ago

Hi @tjduigna,

Thank you for all of your efforts!

I 've been able to get the dataloader working off of your loader branch. What is the recommended way to sample structures based on the clustering you mention in the paper? Is that exposed somewhere?

Best, Kiarash.

tjduigna commented 1 month ago

In an attempt to streamline the delivery of the dataset, we are making a number of clarifications and simplifications in #22. Thank you for bringing these usability concerns to light, it is much appreciated!

You don't provide the ids for version 2?

I'm not sure what you mean by this, versions are self-consistent. A versioned index corresponds to a versioned split.

How come the path to v1 is so illogical? Why not name it PL50?

The chosen split was the result of exhaustive testing and programmatically consistent, but admittedly the naming scheme got a bit out of hand. This will be remedied shortly.

It is not clear how to get the structure info

The simplest approach to obtain structures manually would be to do the following:

mkdir -p systems/
gsutil -m cp -r gs://plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems/ systems/
pushd systems
for i in `ls *zip`; do unzip $i; done
popd

Apologies for the oversight in explaining this. All the PDBs and SDFs are contained in these archives, organized by system_id. The functionality to automatically (lazily) download these archives is incoming and removes the need to manually unzip the files.

Surely, one should not have to load all the data in RAM

Of course not, and I don't think that's recommended anywhere. Please let me know if anything in particular gives that impression.

tjduigna commented 1 month ago

What is the recommended way to sample structures based on the clustering you mention in the paper?

@jamaliki unfortunately right now you would have to add the logic to sample from the clusters in the data loader. Enhancements to the loader functionality are welcome and encouraged! Right now, the splits files contain a cluster column which you can use to group records for sampling.

naefl commented 1 month ago

@patrickbryant1 this should be hopefully clearer via #22 and now explained directly on the main README here https://github.com/plinder-org/plinder?tab=readme-ov-file#downloading-the-dataset, let us know if you're still having troubles. We've updated the naming of the files such that it's less confusing, appreciate the feedback

jamaliki commented 1 month ago

Hey @tjduigna , thanks for the quick reply! I am happy to add new sampling functionality. Is there a quick example you have somewhere of getting these cluster ids? Where do I find the split files? How do I ingest them?

patrickbryant1 commented 1 month ago

Great - thanks for the replies. I think the README already looks much better and I think most people can follow it now. I see you are creating a 'leaderboard' based on the PL50 v2 split. This means that the data available now is somehow already old? Maybe you should write a disclaimer in the beginning so people catch this?

naefl commented 1 month ago

Yes we'll be adding some things for the MLSB challenge (which wasn't officially launched yet!) as laid out in the Changelog, that said v1 is perfectly fine to use and since it's the split in the paper should be considered the current version until the MLSB challenge announcement. We do have a disclaimer right at the beginning of the data download section 20240813-093111

naefl commented 1 month ago

@jamaliki you can grab the cluster id's from the respective split parquet file (gsutil cp gs://plinder/2024-04/v1/splits/plinder-pl50.parquet .). I'll close this issue for now as the original issue was addressed and we can tackle any loader related improvements in dedicated issues.

patrickbryant1 commented 3 weeks ago

I don't really want to reopen this, but it's still not clear what you used for training and testing. In plinder-pl50.parquet you have this many systems for train/val/test: 'train': 255463, 'removed': 151133, 'test': 15132, 'val': 13896

In the preprint you report: 57,602 / 3,453 / 308 This does not correspond to the numbers in the parquet file-

If I take all the unique clusters (not sure what is different between the clusterings?)I get: x[x.split=='test'].cluster.unique().shape (5842,)

x[x.split=='test'].cluster_for_val_split.unique().shape (1902,)

What has really been used here? Is it possible to get a file which lists what ids are in the 57,602 / 3,453 / 308 from the preprint in a simple csv?

Best,

Patrick