Closed munick closed 8 years ago
This specific error seems to arise from an edge case where in a given shell, an atom is lying perfectly on top of the center atom, which should never happen but I guess could if conformer generation did something weird. For this molecule, it appears there are two free Cl ions lying on top of each other, and that's why the error is raised for all shells where the center atom is one of those two atoms. From looking at the downstream effect, the resulting bits are strongly affected by this error. I'll add code to handle this edge case.
I compared my conformers for this molecule generated with the same set of parameters, and they don't have the Cl ions in them. @elcaceres, do you have any insight into why your conformer for this molecule would have Chlorines? Did you do any pre-processing to the SMILES?
@munick I'll commit a fix that catches this error and logs a warning under this circumstance. When you get a chance, if you could pull that fix, remove your print statements, and check to make sure you're not still getting the numpy error warnings, that would be great. I remember seeing that some of the arrays that flew across your terminal were filled with almost entirely nans. Thanks!
@munick, I just pushed the commit that should address this bug.
@sdaxen No, I didn't do any preprocessing for the SMILES before sending them to the script. This was all done pre-standardisation flag though.
This is odd, because I'm not using the standardisation flag either. Could you direct me to your smiles file?
On Fri, Mar 18, 2016, 3:06 PM elcaceres notifications@github.com wrote:
@sdaxen https://github.com/sdaxen No, I didn't do any preprocessing for the SMILES before sending them to the script. This was all done pre-standardisation flag though.
— You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/9#issuecomment-198561985
Sure. It's at /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_smiles_mid_mwcutoff800.smi
From memory, these were the first SMILES pulled from ChEMBL for testing out e3fp. I don't think I modified them in any way. Definitely look like they have Chlorines though.
@sdaxen it does run now but there are a lot of warnings. I saved the output to: /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt
it looks like there are 1,995 compounds with overlapping atoms
grep "WARNING|Overlapping" /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt | cut -d "C" -f2 | sort -u | wc -l
1995
of those, most had overlapping atoms in all 3 conformers
[nmew@mk-1-a myscripts]$ grep "WARNING|Overlapping" /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt | cut -d "C" -f2 | sort | uniq -c | cut -d "H" -f1 | sort | uniq -c
248 1
243 2
1504 3
my googling skills aren't good enough to get me how many atoms were overlapping in each but eyeballing it, most are one atom pair but there are a few with a bunch:
2016-03-19 02:28:08,588|WARNING|Overlapping atoms (0, 1) in conformer 1 of molecule CHEMBL3216947. Fingerprinting will continue but is less reliable.
2016-03-19 02:28:08,633|WARNING|Overlapping atoms (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3) in conformer 0 of molecule CHEMBL2364567. Fingerprinting will continue but is less reliable.
So it looks like in my molecules, around 300 raise this warning. From spot-checking, all of them seem to be due to floating ions. Since the placement of these is more or less arbitrary, I'll add a default flag to mask all atoms with no bonds prior to fingerprinting. That should handle all cases in this category.
@elcaceres, my CHEMBL20 SMILES came from the seaware_academic repo. They don't seem to be the same as yours. For the molecule above, your SMILES are
O.Cl.Cl.CN1CCN(CCN2CCN(CC2)C3CC(c4ccc(F)cc4)c5ccccc35)C1=O CHEMBL1237319
and mine are
CN1CCN(CCN2CCN(C3CC(c4ccc(F)cc4)c4ccccc43)CC2)C1=O CHEMBL1237319
I wonder if they were preprocessed somehow before being deposited there. @mjke, do you know?
Ah, very interesting! @mmysinger do you know anything about this either?
Hi guys, generally we filter out ions and unconnected smaller structures in the standardization & cleanup scripts used to build the seaware libraries. I'd recommend we make this part of our normal cleanup as well. Does the ChEMBL standardiser not do this?
On Mon, Mar 21, 2016 at 3:34 PM, elcaceres notifications@github.com wrote:
Ah, very interesting! @mmysinger https://github.com/mmysinger do you know anything about this either?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/9#issuecomment-199519103
Michael J Keiser, PhD Assistant Professor, UCSF
Dept Pharmaceutical Chemistry Dept Bioengineering & Therapeutic Sciences Institute for Neurodegenerative Diseases
p/1-415-886-7651 keiserlab.org http://www.keiserlab.org calendly.com/mjke
So I've been doing standardization during all phases of ECFP generation, but not before. I think the disconnect with these conformers is that they were generated with the SMILES pulled straight from ChEMBL without standardization (thus "without the new --standardise flag" thing I said earlier) because I had (wrongly) assumed it was implemented in E3FP at the time. Standardization has since been applied to E3FP but these conformers are out of date with respect to that. Long story short: Conformers need to be regenerated with standardization. Standardiser does this, but I don't think ChEMBL does
Thanks, @mjke. I went back and double checked, and the only molecules that had these warnings for me were molecules I'd added myself directly from CHEMBL. It looks like the prefiltering that's been done for SEA is sufficient to do away with this problem. I'll still add the flag in case some poor future soul falls into this trap.
@sdaxen Awesome! As a note: the standardise function removes the Chlorines as well:
In [12]: print smi
Cl.NCCNS(=O)(=O)c1cc(C(=O)Nc2sc3CCCCc3c2C#N)c(Cl)cc1Cl
In [13]: std_mol = standardise.apply(mol)
In [14]: Chem.MolToSmiles(std_mol)
Out[14]: 'N#Cc1c(NC(=O)c2cc(S(=O)(=O)NCCN)c(Cl)cc2Cl)sc2c1CCCC2'
Cool! Yeah I'm just thinking of the use case where someone runs conformer generation using some other method that leaves in the ions and then goes to do fingerprinting. We should just ignore those ions in the conformer.
Ah, sounds good. Good catch, @munick !
when running
native_tuples_from_sdf
I get a warning with certain sdf files.Here's the debug log (with our added print statements) when generating fingerprints for /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_mid_mwcutoff800_conformers_rms0.5_first10/CHEMBL1237319.sdf.bz2