hnlab / can-ai-do

Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets
MIT License
12 stars 3 forks source link

[WIP] Directly sampling decoys from whole ZINC database. #1

Closed 0ut0fcontrol closed 4 years ago

0ut0fcontrol commented 5 years ago

There is topology bias in DUD/DUD-E data set for decoys are selected to be dissimilar to actives.

I tried to sample decoys which dissimilar to actives of same target but similar to actives of other targets. It did not reduce the bias because the decoys still much more similar to decoys than actives for limiting decoys from narrow chemical space.

So now I try to directly sample decoys from whole ZINC database for reducing similarity between decoys.

0ut0fcontrol commented 4 years ago

It is impossible for now to generate decoys can not be seperated by fp because using fp as a filter. Already merge to master at local, close this PR.

AUC EF1