Open agitter opened 4 months ago
Hi @agitter,
Thank you for your feedback.
Currently, all the TDC datasets under tdcommons
were extracted via the TDC API. We intentionally maintain the originality of these datasets/benchmarks to facilitate comparison of evaluation results across different versions of datasets and benchmarks later on.
In Polaris Hub, we have a dedicated package for data curation called Auroris. All dataset curation and benchmark creation steps are available in polaris-recipes. This is to ensure transparency and reproducibility in all data manipulations for datasets hosted on Polaris Hub.
Given our limited bandwidth, we plan to gradually add curated TDC datasets and benchmarks to the hub in the near future. Regarding your concerns about bbb-martin
dataset, we have also found 29 molecules with contradictory labels in the training, validation, and test sets. Additionally, another 45 molecules were duplicated in the original dataset. The curation notebook is available at this link.
We highly encourage you to leave comments or commits to polaris-recipes
so we can discuss dataset issues and best practices for data curation.
Thanks for the background and pointers. Feel free to transfer this issue to the polaris-recipes repo if that is more appropriate.
If the goal of maintaining the original version of the TDC data is to facilitate comparing Polaris evaluations and TDC evaluations, I would advise against it. You showed that Auroris can flag problems in this dataset. Breaking backwards compatibility with TDC to fix those problems and remove duplicates that are in different data splits (if any) or have conflicting labels would make the model evaluations more accurate.
A minor note is that the data link in the notebook https://pubmed.ncbi.nlm.nih.gov/26501955 is for PKIS. PKIS is also referenced in gs://polaris-public/polaris-recipes/org-polaris/drewry2014_pkis1_subset/data/curation/report/index.html
later.
@agitter I transfer the issue here.
I understand the concern regarding back-compatibility due to changes in the splits. In a data-centric manner, one approach could be to evaluate performance exclusively on the overlapping data between the original TDC test set and the curated test set. This will result in one of benchmarks based on this bbb-martins. We can design other benchmarks for this dataset for different purposes.
Does this make sense to you? I look forward to hearing your thoughts.
I will update those wrong links. Thanks!
Zooming out, I would find Polaris most valuable if it focuses on slowing adding high-quality datasets even if that means excluding some existing datasets like tdcommons/bbb-martins
. I don't see a good way to maintain partialy compatibility with the TDC version of bbb-martins
that also addresses the underlying problems. Further, I don't think Polaris needs that version of the dataset.
There may still be a way to develop an acceptable blood brain barrier dataset from the original Martins data. By ignoring what MoleculeNet and TDC did with this dataset, you would be free to recreate splits, resolve duplicates and conflicts, etc. So I would propose to only do the "design other benchmarks for this dataset" part.
Polaris version
0.7.5
Python Version
3.11
Operating System
Windows
Installation
Using conda
Description
The tdcommons/bbb-martins blood-brain barrier dataset has the same data quality issues that Pat Walters reported in his blog post that was cited in the Polaris launch announcement. That includes duplicate molecules and 10 duplicate molecules with conflicting labels.
If this dataset will be updated to fix the issues below, will Polaris support dataset version numbers? Combining benchmark results from different versions of a dataset would be confusing.
Steps to reproduce
Here are the 10 duplicates with conflicting labels
I no longer see 59 total duplicates but rather only 52. There is are drugs and their counts
The number of molecules in the dataset is also confusing. The description says the dataset was sourced from TDC via MoleculeNet. The Polaris version has 2030 molecules. TDC has 1975. MoleculeNet has 2050 (their paper says 2053 but the csv file has 2050 rows). With the data being passed from source to source, it is hard to tell what the data processing chain is.
Additional output
No response