openkinome / kinodata

Collection of scripts / notebooks to reliably select datasets
MIT License
27 stars 18 forks source link

Discrepancies in summary data #13

Open corey-taylor opened 3 years ago

corey-taylor commented 3 years ago

The summary data in the last release (https://github.com/openkinome/kinodata/releases/tag/v0.2) differs from what reported by the notebooks and in the data files that are outputted by them:

Dataset Non-curated Curated
ChEMBL 27 182 223 148 836
ChEMBL 28 199 238 159 978

vs the number of unique records in the output .csv's and reported in the notebooks:

Dataset Non-curated Curated
ChEMBL 27 217 612 174 238
ChEMBL 28 237 336 186 972

The notebooks appear to run fine so I have added the data from the notebooks/output files themselves to the latest release (https://github.com/openkinome/kinodata/releases/tag/v0.3). But having looked at the data directly, I can't establish where the data in the first table comes from.

AndreaVolkamer commented 8 months ago

@mbackenkoehler and @ijpulidos I'm tagging you both here, maybe you can have a look if this is solved, since you are on the ChEMBL update anyways?