Data quality problems in tdcommons/bbb-martins dataset

agitter commented 4 months ago

Polaris version

0.7.5

Python Version

3.11

Operating System

Windows

Installation

Using conda

Description

The tdcommons/bbb-martins blood-brain barrier dataset has the same data quality issues that Pat Walters reported in his blog post that was cited in the Polaris launch announcement. That includes duplicate molecules and 10 duplicate molecules with conflicting labels.

If this dataset will be updated to fix the issues below, will Polaris support dataset version numbers? Combining benchmark results from different versions of a dataset would be confusing.

Steps to reproduce

Here are the 10 duplicates with conflicting labels

    Drug_ID Drug    Y   split
27  acetylsalicylate    CC(=O)Oc1ccccc1C(=O)O   0   train_val
127 aspirin CC(=O)Oc1ccccc1C(=O)O   1   train_val
1574    loratadine  CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1    1   train_val
1573    loratadine  CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1    0   train_val
1378    loperamide  CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccccc1)c1ccccc1  0   train_val
1377    BRL53080    CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccccc1)c1ccccc1  1   train_val
308 atropine(hyoscyamine)   CN1C2CCC1CC(OC(=O)C(CO)c1ccccc1)C2  1   train_val
310 atropine    CN1C2CCC1CC(OC(=O)C(CO)c1ccccc1)C2  0   train_val
1314    trimetrexate    COc1cc(NCc2ccc3nc(N)nc(N)c3c2C)cc(OC)c1OC   1   train_val
1313    Trimetrexate    COc1cc(NCc2ccc3nc(N)nc(N)c3c2C)cc(OC)c1OC   0   train_val
1171    indomethacin    COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1  0   train_val
1170    indomethacin(indometacin)   COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1  1   train_val
542 methylprednisolone  C[C@H]1C[C@H]2[C@@H]3CC[C@](O)(C(=O)CO)[C@@]3(C)C[C@H](O)[C@@H]2[C@@]2(C)C=CC(=O)C=C12  1   train_val
521 Methylprednisolone  C[C@H]1C[C@H]2[C@@H]3CC[C@](O)(C(=O)CO)[C@@]3(C)C[C@H](O)[C@@H]2[C@@]2(C)C=CC(=O)C=C12  0   train_val
1799    Miconazole  Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1 0   test
1800    miconazole  Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1 1   test
98  levodopa    N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O  1   train_val
83  levodopa    N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O  0   train_val
1974    mequitazine c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2 1   test
1975    mequitazine c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2 0   test

I no longer see 59 total duplicates but rather only 52. There is are drugs and their counts

Drug
CN(C)CCc1ccccn1                                                                           3
CNCCc1ccccn1                                                                              3
NCCc1cn2ccccc2n1                                                                          3
FC(F)(F)CCl                                                                               2
c1cc(CN2CCCCC2)cc(OCCCNc2nc3ccccc3o2)c1                                                   2
CC(=O)Nc1nnc(S(N)(=O)=O)s1                                                                2
O=C(NCCN1CCOCC1)c1ccc(Cl)cc1                                                              2
COc1cc2c(cc1OC)C1CC(=O)C(CC(C)C)CN1CC2                                                    2
CN1C2CCC1CC(OC(=O)C(CO)c1ccccc1)C2                                                        2
CN1CCN(CCCN2c3ccccc3Sc3ccc(C(F)(F)F)cc32)CC1                                              2
CN(C)Cc1ccc(CSCCNc2nc(=O)c(Cc3ccc4ccccc4c3)c[nH]2)o1                                      2
COc1cc(NCc2ccc3nc(N)nc(N)c3c2C)cc(OC)c1OC                                                 2
c1cc(CN2CCCCC2)cc(OCCCNc2nccs2)c1                                                         2
CC1COc2c(N3CCN(C)CC3)c(F)cc3c(=O)c(C(=O)O)cn1c23                                          2
CC(C)(C)OC(=O)CCCc1ccc(N(CCCl)CCCl)cc1                                                    2
CCC(=O)C(CC(C)N(C)C)(c1ccccc1)c1ccccc1                                                    2
CCCCCn1ccc(=O)c(O)c1C                                                                     2
Cc1cc(NS(=O)(=O)c2ccc(N)cc2)no1                                                           2
CC(CN1c2ccccc2Sc2ccccc21)N(C)C                                                            2
OCCCOc1cccc(CN2CCCCC2)c1                                                                  2
COc1ccc2c(c1)N(C[C@H](C)CN(C)C)c1ccccc1S2                                                 2
O=C(NCCCOc1cccc(CN2CCCCC2)c1)c1ccccc1                                                     2
CCOC(=O)N1CCC(=C2c3ccc(Cl)cc3CCc3cccnc32)CC1                                              2
C=COC=C                                                                                   2
Nc1nc(=O)c2ncn(COCCO)c2[nH]1                                                              2
c1ccc2c(c1)Sc1ccccc1N2CC1CN2CCC1CC2                                                       2
CN(C)C/C=C(/c1ccc(Br)cc1)c1cccnc1                                                         2
c1ccc(NCCCOc2cccc(CN3CCCCC3)c2)nc1                                                        2
Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1                                               2
ClC(Cl)Cl                                                                                 2
CCc1ccccc1                                                                                2
CCCC(=O)Nc1ccc(OCC(O)CNC(C)C)c(C(C)=O)c1                                                  2
FC(F)(F)c1ccc(N2CCNCC2)nc1Cl                                                              2
CN(C)CCC=C1c2ccccc2CCc2ccccc21                                                            2
Cc1ncc2n1-c1ccc(Cl)cc1C(c1ccccc1F)=NC2                                                    2
C[C@H]1C[C@H]2[C@@H]3CC[C@](O)(C(=O)CO)[C@@]3(C)C[C@H](O)[C@@H]2[C@@]2(C)C=CC(=O)C=C12    2
CC(C)c1nc(-c2ncn3c2CN(C)C(=O)c2c(Cl)cccc2-3)no1                                           2
NC(=O)N1c2ccccc2C2OC2c2ccccc21                                                            2
COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1                                            2
CN(C)CCCN1c2ccccc2CCc2ccccc21                                                             2
CN1Cc2c(-c3noc(C(C)(C)O)n3)ncn2-c2cccc(Cl)c2C1=O                                          2
CCCN(CCC)CCc1cccc2c1CC(=O)N2                                                              2
CCC(=O)c1ccc2c(c1)N(CCCN1CCN(CCO)CC1)c1ccccc1S2                                           2
NCCc1nccs1                                                                                2
CN(C)C(=O)C(CCN1CCC(O)(c2ccc(Cl)cc2)CC1)(c1ccccc1)c1ccccc1                                2
N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O                                                            2
CC(=O)Oc1ccccc1C(=O)O                                                                     2
CC(=O)Nc1ccc(O)cc1                                                                        2
OCCN1CCN(CCCN2c3ccccc3Sc3ccc(Cl)cc32)CC1                                                  2
OCCN1CCN(CC/C=C2/c3ccccc3Sc3ccc(Cl)cc32)CC1                                               2
O=C1NC(=O)C(c2ccccc2)(c2ccccc2)N1                                                         2
ClCCl                                                                                     2

The number of molecules in the dataset is also confusing. The description says the dataset was sourced from TDC via MoleculeNet. The Polaris version has 2030 molecules. TDC has 1975. MoleculeNet has 2050 (their paper says 2053 but the csv file has 2050 rows). With the data being passed from source to source, it is hard to tell what the data processing chain is.

Additional output

No response

zhu0619 commented 4 months ago

Hi @agitter,

Thank you for your feedback.

Currently, all the TDC datasets under tdcommons were extracted via the TDC API. We intentionally maintain the originality of these datasets/benchmarks to facilitate comparison of evaluation results across different versions of datasets and benchmarks later on.

In Polaris Hub, we have a dedicated package for data curation called Auroris. All dataset curation and benchmark creation steps are available in polaris-recipes. This is to ensure transparency and reproducibility in all data manipulations for datasets hosted on Polaris Hub.

Given our limited bandwidth, we plan to gradually add curated TDC datasets and benchmarks to the hub in the near future. Regarding your concerns about bbb-martin dataset, we have also found 29 molecules with contradictory labels in the training, validation, and test sets. Additionally, another 45 molecules were duplicated in the original dataset. The curation notebook is available at this link.

We highly encourage you to leave comments or commits to polaris-recipes so we can discuss dataset issues and best practices for data curation.

agitter commented 4 months ago

Thanks for the background and pointers. Feel free to transfer this issue to the polaris-recipes repo if that is more appropriate.

If the goal of maintaining the original version of the TDC data is to facilitate comparing Polaris evaluations and TDC evaluations, I would advise against it. You showed that Auroris can flag problems in this dataset. Breaking backwards compatibility with TDC to fix those problems and remove duplicates that are in different data splits (if any) or have conflicting labels would make the model evaluations more accurate.

A minor note is that the data link in the notebook https://pubmed.ncbi.nlm.nih.gov/26501955 is for PKIS. PKIS is also referenced in gs://polaris-public/polaris-recipes/org-polaris/drewry2014_pkis1_subset/data/curation/report/index.html later.

zhu0619 commented 4 months ago

@agitter I transfer the issue here.

I understand the concern regarding back-compatibility due to changes in the splits. In a data-centric manner, one approach could be to evaluate performance exclusively on the overlapping data between the original TDC test set and the curated test set. This will result in one of benchmarks based on this bbb-martins. We can design other benchmarks for this dataset for different purposes.

Does this make sense to you? I look forward to hearing your thoughts.

I will update those wrong links. Thanks!

agitter commented 4 months ago

Zooming out, I would find Polaris most valuable if it focuses on slowing adding high-quality datasets even if that means excluding some existing datasets like tdcommons/bbb-martins. I don't see a good way to maintain partialy compatibility with the TDC version of bbb-martins that also addresses the underlying problems. Further, I don't think Polaris needs that version of the dataset.

There may still be a way to develop an acceptable blood brain barrier dataset from the original Martins data. By ignoring what MoleculeNet and TDC did with this dataset, you would be free to recreate splits, resolve duplicates and conflicts, etc. So I would propose to only do the "design other benchmarks for this dataset" part.

polaris-hub / polaris-recipes