Open LivC193 opened 4 years ago
You are right about the cross-class CV, and we also expect it will not work.
However, FP+RF achieved good performance (FIGURE 3B) in the cross-class CV on DUD-E (MW<=500).
It means that "the model can still use topology bias in DUD-E even after avoiding the property bias."
We found the topology distribution (bit frequencies) is different between all actives and all decoys in the DUD-E(MW<=500) dataset. It can explain (at least partially) why the model has prediction power in the cross-class CV.
"Therefore, the DUD and DUD-E datasets are not suitable for training models which directly or indirectly utilize the compound topological information."
"However, DUD-E can still serve as an independent dataset to test the prediction power of AI models without using it for training."
@0ut0fcontrol thank you very much for the response. All clear now.
I’m glad it helps. 😀
Can you explain what this means?
"We split DUD-E into three folds based on target classes to perform the cross-class CV study. There are 26 kinases in the first fold, 31 targets in the second fold (including 15 proteases, 11 nuclear receptors, and five G-protein coupled receptors), and the rest of 45 targets in the third fold. We also applied a random CV on DUD-E by randomly splitting the targets into three folds with the same fold sizes as the cross-class CV."
Do you use 2 folds to train (kinase and second fold) and u use the third one (45 targets) to test? If so why would you expect it to work? The dataset is skewed to decoys and whatever actives you find for one class of proteins (kinases lets say) it is impossible to translate that knowledge to a completely different target like nuclear receptors.