Closed juliabuhmann closed 3 years ago
pK is the negative log (base 10). So nanomolar (10^-9) is 9.
We treat Kd/Ki/IC50 the same in mapping to pK.
A 0 pK means there is no affinity data for that protein-ligand pair. A negative pK means there is affinity data, but the pose is bad (>2 RMSD). We apply a hinge loss in this case so that overpredicting the affinity of a bad pose is penalized but underpredicting is not.
Great, thanks a lot for the quick and helpful reply! Two follow-up questions.
pK-Value
I think if this information would be added to the README at https://github.com/gnina/models/tree/master/data/CrossDocked2020 that would be great. (--> Enhancing this sentence: Where the label is 1 if the RMSD to the crystal pose is <=2, and the pK is negative if the pose is >2.
.)
Thanks again!
@francoep Do we have the crystal ligands somewhere?
The negative pK is the pK negated. If you have a nanomolar binder, the good poses will be labeled 9 and the bad poses -9. The negative is just there to indicate it is a bad pose and make it easier to apply the hinge loss.
Yes, the RMSD was calculated with respect to the original crystal ligand.
If they are in the provided downloads, then they would be labeled
Thanks for the additional information and for updating the README!
I guess, you meant the naming scheme is: < PDBid >_< ligname >_lig.pdb, for instance 1g9v_rq3_lig.pdb I found 22`468 such files (unique) in http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020/
Every atom has an inherent gnina type, and these get aggregated into combined types (one on each line) in files like this: https://github.com/gnina/models/blob/master/acs2018/completelig which is included from the moldatalayer.
I have a question, what is the stratify_receptor
? I couldn't find it in any of your papers. Is it sothing to do with binding affinity?
Sample uniformly across receptors. So if receptor A has 1000 poses and receptor B has 10 poses, if you have a batch of size 10 you would have 5 examples from A and 5 from B. This avoids learning biases related to imbalances in the training data.
Thanks a lot for making those datasets available! Very much appreciated.
Thanks for clarification!