google-deepmind / alphafold3

AlphaFold 3 inference pipeline.
Other
5.06k stars 563 forks source link

With multiple ligand copies (SMILES), sometimes get "Failed to construct RDKit reference structure" #102

Open smg3d opened 23 hours ago

smg3d commented 23 hours ago

Input is one protein + N copies of the same ligand.

Depending on the value of N (40, 50, 60, 80, 100, ..., 200), I get between 1 and 6 rdkit warning during "constructing SMILES reference structure". The warning message is :

I1120 02:11:55.910477 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BS
I1120 02:11:55.964899 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BT
W1120 02:11:55.997912 140432297644032 features.py:1519] Failed to construct RDKit reference structure for: LIG_BT
W1120 02:11:55.998116 140432297644032 features.py:1558] All ref positions unknown for: LIG_BT
I1120 02:11:56.001026 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BU
I1120 02:11:56.066736 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BV
I1120 02:11:56.152411 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BW

also, if I get one rdkit warning, I also get the following (the number of lines = number of atoms in the ligand).

I1120 02:11:56.518429 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518555 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518632 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518705 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518779 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518856 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518932 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519007 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519084 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519158 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519235 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519312 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519390 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519466 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519541 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519618 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519696 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519773 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519850 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519926 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520001 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520076 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520153 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520230 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520307 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520385 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520462 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520538 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520613 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520689 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520765 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520841 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.

The structure inference proceed without warning / error, and the ligand with rdkit warning have coordinates

HETATM 2861 C C21 . LIG_BS BS 34 .   ? -20.036 8.644   -2.704  1.00 37.82 1   BS 1 
HETATM 2862 C C22 . LIG_BS BS 34 .   ? -19.076 9.331   -1.829  1.00 38.11 1   BS 1 
HETATM 2863 O O7  . LIG_BS BS 34 .   ? -20.886 7.213   -9.300  1.00 38.66 1   BS 1 
HETATM 2864 O O8  . LIG_BS BS 34 .   ? -25.223 9.323   -13.970 1.00 40.94 1   BS 1 
HETATM 2865 O O1  . LIG_BT BT 35 .   ? -5.460  -5.732  7.422   1.00 50.92 1   BT 1 
HETATM 2866 P P1  . LIG_BT BT 35 .   ? -4.092  -5.278  7.721   1.00 56.34 1   BT 1 
HETATM 2867 O O2  . LIG_BT BT 35 .   ? -3.877  -5.308  9.260   1.00 40.80 1   BT 1 
HETATM 2868 C C1  . LIG_BT BT 35 .   ? -3.126  -6.282  9.781   1.00 51.54 1   BT 1 
HETATM 2869 C C2  . LIG_BT BT 35 .   ? -3.604  -6.512  11.152  1.00 51.66 1   BT 1 
HETATM 2870 N N1  . LIG_BT BT 35 .   ? -2.710  -7.340  11.881  1.00 35.27 1   BT 1 
HETATM 2871 C C3  . LIG_BT BT 35 .   ? -1.487  -6.620  12.127  1.00 45.03 1   BT 1 
HETATM 2872 C C4  . LIG_BT BT 35 .   ? -2.420  -8.536  11.153  1.00 42.59 1   BT 1 
HETATM 2873 C C5  . LIG_BT BT 35 .   ? -3.315  -7.687  13.119  1.00 43.89 1   BT 1 
HETATM 2874 O O3  . LIG_BT BT 35 .   ? -3.882  -3.765  7.361   1.00 35.57 1   BT 1 
HETATM 2875 C C6  . LIG_BT BT 35 .   ? -4.932  -2.899  7.372   1.00 40.89 1   BT 1 
HETATM 2876 C C7  . LIG_BT BT 35 .   ? -4.675  -1.847  6.347   1.00 42.96 1   BT 1 
HETATM 2877 O O4  . LIG_BT BT 35 .   ? -5.889  -1.204  6.084   1.00 37.67 1   BT 1 
HETATM 2878 C C8  . LIG_BT BT 35 .   ? -5.872  0.119   6.255   1.00 41.91 1   BT 1 
HETATM 2879 C C9  . LIG_BT BT 35 .   ? -6.749  0.847   5.345   1.00 32.44 1   BT 1 
HETATM 2880 C C10 . LIG_BT BT 35 .   ? -6.126  2.123   4.969   1.00 35.99 1   BT 1 
HETATM 2881 C C11 . LIG_BT BT 35 .   ? -6.872  2.717   3.827   1.00 34.30 1   BT 1 
HETATM 2882 C C12 . LIG_BT BT 35 .   ? -6.222  3.988   3.425   1.00 34.18 1   BT 1 
HETATM 2883 C C13 . LIG_BT BT 35 .   ? -6.926  4.564   2.248   1.00 32.43 1   BT 1 
HETATM 2884 C C14 . LIG_BT BT 35 .   ? -6.304  5.835   1.829   1.00 30.69 1   BT 1 
HETATM 2885 O O5  . LIG_BT BT 35 .   ? -5.224  0.634   7.068   1.00 39.24 1   BT 1 
HETATM 2886 C C15 . LIG_BT BT 35 .   ? -4.220  -2.458  5.068   1.00 46.10 1   BT 1 
HETATM 2887 O O6  . LIG_BT BT 35 .   ? -3.778  -1.454  4.222   1.00 45.43 1   BT 1 
HETATM 2888 C C16 . LIG_BT BT 35 .   ? -2.489  -1.455  4.020   1.00 45.46 1   BT 1 
HETATM 2889 C C17 . LIG_BT BT 35 .   ? -1.932  -0.427  3.122   1.00 40.81 1   BT 1 
HETATM 2890 C C18 . LIG_BT BT 35 .   ? -2.939  0.165   2.216   1.00 35.12 1   BT 1 
HETATM 2891 C C19 . LIG_BT BT 35 .   ? -2.288  1.214   1.384   1.00 36.52 1   BT 1 
HETATM 2892 C C20 . LIG_BT BT 35 .   ? -3.278  1.842   0.482   1.00 34.07 1   BT 1 
HETATM 2893 C C21 . LIG_BT BT 35 .   ? -2.619  2.921   -0.314  1.00 33.31 1   BT 1 
HETATM 2894 C C22 . LIG_BT BT 35 .   ? -3.600  3.558   -1.222  1.00 31.37 1   BT 1 
HETATM 2895 O O7  . LIG_BT BT 35 .   ? -1.770  -2.254  4.519   1.00 41.62 1   BT 1 
HETATM 2896 O O8  . LIG_BT BT 35 .   ? -3.037  -6.058  7.053   1.00 41.36 1   BT 1 
HETATM 2897 O O1  . LIG_BU BU 36 .   ? 18.988  0.042   0.142   1.00 43.12 1   BU 1 
HETATM 2898 P P1  . LIG_BU BU 36 .   ? 19.127  -1.382  -0.125  1.00 52.35 1   BU 1 
HETATM 2899 O O2  . LIG_BU BU 36 .   ? 20.431  -1.602  -0.914  1.00 34.40 1   BU 1 

However, all metrics related to that ligand are null in summary_confidences.json:

 "chain_pair_iptm": [
  [
   0.78,
...
   0.02,
   null,                            <=== diagonal for LIG_BT
   0.03,
...
  ],
 "chain_pair_pae_min": [
...
  [                                   <===  LIG_BT
   null,
   null,
...
   null,
   null
  ],
...
 ],
 "chain_ptm": [
  0.78,
...
  0.31,
  null,                           <===  LIG_BT
  0.33,
...
 ],

The number of problematic ligands varies between runs with different ligands, and sometimes between different seeds within the same run, eg:

# SEED 1
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:15.401729 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_FI
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:17.371930 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_DL
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:19.033136 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_FN
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:23.795290 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GT
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:33.568039 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GF
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:38.974108 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GM

#SEED 2
lrat_dhpc-11/gra1342-27320233.out:W1120 22:24:17.748973 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GF
lrat_dhpc-11/gra1342-27320233.out:W1120 22:24:23.128701 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GM

For 30+ runs with N >= 40 : they all get at least one warning (with associated null metrics.) For all runs with N<= 30: no rdkit warning

The structure of the problematic ligand appears normal.

If it wasn't for the null metrics associated to that ligand, I would not worry. Maybe all is fine, and it might just be a problem with the metrics computation routine if there is somehow something "wrong" with that ligand at the start (i.e. Found identical coordinates: Assigning as colinear.).

joshabramson commented 15 minutes ago

Thanks for the report. This happens if rdkit fails to generate a conformer for some random seeds, and there is no fallback idealised coordinates given in the ccd cif defining the ligand input. You can work around this by adding idealised coordinates.

When there are no conformer coordinates, we cannot generate frames for PAE and without a frame we give up on generating a confidence. However that is behavior we could change - we had single-atom ions in mind for that case (where there were no frames in training either), full ligands should be fine at inference time, as the frames aren't actually used at inference time. But perhaps given there are no reference coordinates, its better to have nans here, so that users are aware by looking at the output that something is different in these cases (likely not as good a prediction).