CrossDocked2020 dataset questions

juliabuhmann commented 3 years ago

Thanks a lot for making those datasets available! Very much appreciated.

There is no equation given in the paper on how you got from the Kd(?)/Ki(?) values provided by the PDBBind webpage to the pK values used in the paper and provided in the types-files.
Are both Kd and Ki values mapped to the same pK?
To understand the dataset better, I checked the pk-Values of this types-file: it0_tt_0_train0.types (from this directory: http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_types.tar.gz)
- I find that ~50% of the lines/samples have a pK value of 0. Is this meaningful?
- I find (while FigureS12 is showing experimental pK values in the range of 2 -12), that ~30% of the pK values in the above types-files are negative

Thanks for clarification!

dkoes commented 3 years ago

pK is the negative log (base 10). So nanomolar (10^-9) is 9.

We treat Kd/Ki/IC50 the same in mapping to pK.

A 0 pK means there is no affinity data for that protein-ligand pair. A negative pK means there is affinity data, but the pose is bad (>2 RMSD). We apply a hinge loss in this case so that overpredicting the affinity of a bad pose is penalized but underpredicting is not.

juliabuhmann commented 3 years ago

Great, thanks a lot for the quick and helpful reply! Two follow-up questions.

pK-Value

the pK column expresses true experimental data when pK > 0, no binding affinity value is known when pK=0
1. when pK < 0, how did you compute those negative pK values?

I think if this information would be added to the README at https://github.com/gnina/models/tree/master/data/CrossDocked2020 that would be great. (--> Enhancing this sentence: Where the label is 1 if the RMSD to the crystal pose is <=2, and the pK is negative if the pose is >2..)

Am I right to assume that the RMSD values in the types file were computed against the original crystal ligand pose?
- I could not find those original crystal ligand sdfs in the dataset, do I have to download them from pdbbind directly? And if they are included in the dataset, how can I find them (what is the naming scheme for crystal ligands?).

Thanks again!

dkoes commented 3 years ago

@francoep Do we have the crystal ligands somewhere?

The negative pK is the pK negated. If you have a nanomolar binder, the good poses will be labeled 9 and the bad poses -9. The negative is just there to indicate it is a bad pose and make it easier to apply the hinge loss.

francoep commented 3 years ago

Yes, the RMSD was calculated with respect to the original crystal ligand.

If they are in the provided downloads, then they would be labeled __lig.pdb. I'm not sure if we included all of them.

juliabuhmann commented 3 years ago

Thanks for the additional information and for updating the README!

I guess, you meant the naming scheme is: < PDBid >_< ligname >_lig.pdb, for instance 1g9v_rq3_lig.pdb I found 22`468 such files (unique) in http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020/

dkoes commented 1 year ago

Every atom has an inherent gnina type, and these get aggregated into combined types (one on each line) in files like this: https://github.com/gnina/models/blob/master/acs2018/completelig which is included from the moldatalayer.

Kerro-junior commented 11 months ago

I have a question, what is the stratify_receptor? I couldn't find it in any of your papers. Is it sothing to do with binding affinity?

dkoes commented 10 months ago

Sample uniformly across receptors. So if receptor A has 1000 poses and receptor B has 10 poses, if you have a batch of size 10 you would have 5 examples from A and 5 from B. This avoids learning biases related to imbalances in the training data.

gnina / models

CrossDocked2020 dataset questions #8