Closed patrickbryant1 closed 2 months ago
The split for retraining DiffDock is here: gs://plinder/2024-04/v0/splits/plinder-pl50.parquet
(57734/3459/3758), with discrepancy due to Diffdock failures in processing (so the split that diffdock fully processed and with posebusters systems removed from test which matches the numbers in the table is in gs://plinder/2024-04/v0/splits/plinder-no-posebusters.parquet
). This was in the Changelog in the README but we're moving docs around and it got removed, now added back.
Note that this split was only on a single ligand subset with (too) strict nonredundancy filters, and was used to demonstrate that an automated split can achieve as low leakage as a curated one. We've made a number of updates to improve annotations and splitting, and would recommend the v2 split.
Hi,
Thanks for the quick reply. Great, then we get it
Best,
Patrick
'train': 255463, 'removed': 151133, 'test': 15132, 'val': 13896
In the preprint you report: 57,602 / 3,453 / 308 This does not correspond to the numbers in the parquet file-
If I take all the unique clusters (not sure what is different between the clusterings?)I get: x[x.split=='test'].cluster.unique().shape (5842,)
What has really been used here? Is it possible to get a file which lists what ids are in the 57,602 / 3,453 / 308 from the preprint in a simple csv?
Best,
Patrick
Originally posted by @patrickbryant1 in https://github.com/plinder-org/plinder/issues/20#issuecomment-2322023047