keiserlab / e3fp

3D molecular fingerprints
GNU Lesser General Public License v3.0
121 stars 33 forks source link

RuntimeWarning: invalid value encountered in less #9

Closed munick closed 8 years ago

munick commented 8 years ago

when running native_tuples_from_sdf I get a warning with certain sdf files.

Here's the debug log (with our added print statements) when generating fingerprints for /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_mid_mwcutoff800_conformers_rms0.5_first10/CHEMBL1237319.sdf.bz2

2016-03-17 22:05:51,536|INFO|Generated 3 fingerprints for CHEMBL436273.
2016-03-17 22:05:51,541|DEBUG|Wrote native strings for molecule 61 to molecules file.
2016-03-17 22:05:51,549|DEBUG|Read 10 conformers from /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_mid_mwcutoff800_conformers_rms0.5_first10/CHEMBL1237319.sdf.bz2.
2016-03-17 22:05:51,549|INFO|Generating fingerprints for CHEMBL1237319.
/srv/home/nmew/myprojects/e3fp/e3fp/fingerprint/array_ops.py:36: RuntimeWarning: invalid value encountered in divide
  u /= mag
/srv/home/nmew/myprojects/e3fp/e3fp/fingerprint/fprinter.py:845: RuntimeWarning: invalid value encountered in less
  long_angle[np.fabs(long_angle) < EPS] = 0.  # perfect right angles
/srv/home/nmew/myprojects/e3fp/e3fp/fingerprint/fprinter.py:861: RuntimeWarning: invalid value encountered in less
  pole_inds = np.where(np.pi/2 - long_angle < POLAR_CONE_RAD)
[ 1.57079633  0.31248145         nan  0.45516691]
(array([0]),)
[1]
[ 1.57079633  0.31248145         nan  0.45516691]
(array([0]),)
[1]
[ 1.57079631  1.47906026  0.41370037  0.77740228  0.56549272  0.00828483
  0.64130602  1.07854162  0.45516691  0.98647732         nan]
(array([0]),)
[1]
[ 1.57079631  1.47906026  0.41370037  0.77740228  0.56549272  0.00828483
  0.64130602  1.07854162  0.45516691  0.98647732         nan]
(array([0]),)
[1]
[ 1.57079631  1.22271505  0.9585553          nan  0.63121438  1.47906026
  0.79885896  0.54478009  1.03766873  0.32548157  0.57498842  0.05658342
  1.18139355  1.05129018  0.68989266]
(array([0]),)
[1]
[ 1.57079631  1.22271505  0.9585553          nan  0.63121438  1.47906026
  0.79885896  0.54478009  1.03766873  0.32548157  0.57498842  0.05658342
  1.18139355  1.05129018  0.68989266]
(array([0]),)
[1]
[ 1.57079633  1.31577727  1.31500776         nan  1.28507744  1.39850439
  0.14917578  0.99059568  1.03766873  0.4107033   1.12352707  0.96839275
  0.69206276  1.29264908  1.07854162  1.47155832  0.2890973   1.10812961
  0.72396623  0.90273306  0.9588358   1.0925341 ]
(array([0]),)
[1]
[ 1.57079633  1.31577727  1.31500776         nan  1.28507744  1.39850439
  0.14917578  0.99059568  1.03766873  0.4107033   1.12352707  0.96839275
  0.69206276  1.29264908  1.07854162  1.47155832  0.2890973   1.10812961
  0.72396623  0.90273306  0.9588358   1.0925341 ]
(array([0]),)
[1]
[ 1.57079631  0.96839275  1.18032516  0.85513393  0.69042051  0.71215396
  0.11930842  0.80152987  0.7164222   1.02283212  0.87814927  1.05129018
  1.03592242  0.60440583  0.56687702  0.71753327  0.40101425  0.74127808
  0.7128208          nan  0.83519906  0.88550881  0.97731149  0.83604344
  0.66122576  0.93115759  0.97252186  1.00529038  0.98647732]
(array([0]),)
[1]
[ 1.57079631  0.96839275  1.18032516  0.85513393  0.69042051  0.71215396
  0.11930842  0.80152987  0.7164222   1.02283212  0.87814927  1.05129018
  1.03592242  0.60440583  0.56687702  0.71753327  0.40101425  0.74127808
  0.7128208          nan  0.83519906  0.88550881  0.97731149  0.83604344
  0.66122576  0.93115759  0.97252186  1.00529038  0.98647732]
(array([0]),)
[1]
2016-03-17 22:05:51,725|DEBUG|
2016-03-17 22:05:51,886|DEBUG|
2016-03-17 22:05:52,046|DEBUG|
2016-03-17 22:05:52,046|INFO|Generated 3 fingerprints for CHEMBL1237319.
2016-03-17 22:05:52,052|DEBUG|Wrote native strings for molecule 62 to molecules file.
2016-03-17 22:05:52,063|DEBUG|Read 10 conformers from /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_mid_mwcutoff800_conformers_rms0.5_first10/CHEMBL2018492.sdf.bz2.
2016-03-17 22:05:52,063|INFO|Generating fingerprints for CHEMBL2018492.
2016-03-17 22:05:52,261|DEBUG|
2016-03-17 22:05:52,457|DEBUG|
2016-03-17 22:05:52,653|DEBUG|
2016-03-17 22:05:52,653|INFO|Generated 3 fingerprints for CHEMBL2018492.
2
sethaxen commented 8 years ago

This specific error seems to arise from an edge case where in a given shell, an atom is lying perfectly on top of the center atom, which should never happen but I guess could if conformer generation did something weird. For this molecule, it appears there are two free Cl ions lying on top of each other, and that's why the error is raised for all shells where the center atom is one of those two atoms. From looking at the downstream effect, the resulting bits are strongly affected by this error. I'll add code to handle this edge case.

I compared my conformers for this molecule generated with the same set of parameters, and they don't have the Cl ions in them. @elcaceres, do you have any insight into why your conformer for this molecule would have Chlorines? Did you do any pre-processing to the SMILES?

@munick I'll commit a fix that catches this error and logs a warning under this circumstance. When you get a chance, if you could pull that fix, remove your print statements, and check to make sure you're not still getting the numpy error warnings, that would be great. I remember seeing that some of the arrays that flew across your terminal were filled with almost entirely nans. Thanks!

sethaxen commented 8 years ago

@munick, I just pushed the commit that should address this bug.

elcaceres commented 8 years ago

@sdaxen No, I didn't do any preprocessing for the SMILES before sending them to the script. This was all done pre-standardisation flag though.

sethaxen commented 8 years ago

This is odd, because I'm not using the standardisation flag either. Could you direct me to your smiles file?

On Fri, Mar 18, 2016, 3:06 PM elcaceres notifications@github.com wrote:

@sdaxen https://github.com/sdaxen No, I didn't do any preprocessing for the SMILES before sending them to the script. This was all done pre-standardisation flag though.

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/9#issuecomment-198561985

elcaceres commented 8 years ago

Sure. It's at /srv/home/ecaceres/chembl20_minibatching/datasets/e3fp_conformers/all_chembl_smiles_mid_mwcutoff800.smi

From memory, these were the first SMILES pulled from ChEMBL for testing out e3fp. I don't think I modified them in any way. Definitely look like they have Chlorines though.

munick commented 8 years ago

@sdaxen it does run now but there are a lot of warnings. I saved the output to: /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt

it looks like there are 1,995 compounds with overlapping atoms

grep "WARNING|Overlapping"  /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt | cut -d "C" -f2 | sort -u | wc -l
1995

of those, most had overlapping atoms in all 3 conformers

[nmew@mk-1-a myscripts]$ grep "WARNING|Overlapping"  /srv/home/nmew/myprojects/e3fp/myscripts/parallel_fingerprints_out.txt | cut -d "C" -f2 | sort | uniq -c | cut -d "H" -f1 | sort | uniq -c
    248       1 
    243       2 
   1504       3 

my googling skills aren't good enough to get me how many atoms were overlapping in each but eyeballing it, most are one atom pair but there are a few with a bunch:

2016-03-19 02:28:08,588|WARNING|Overlapping atoms (0, 1) in conformer 1 of molecule CHEMBL3216947. Fingerprinting will continue but is less reliable.
2016-03-19 02:28:08,633|WARNING|Overlapping atoms (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3) in conformer 0 of molecule CHEMBL2364567. Fingerprinting will continue but is less reliable.
sethaxen commented 8 years ago

So it looks like in my molecules, around 300 raise this warning. From spot-checking, all of them seem to be due to floating ions. Since the placement of these is more or less arbitrary, I'll add a default flag to mask all atoms with no bonds prior to fingerprinting. That should handle all cases in this category.

@elcaceres, my CHEMBL20 SMILES came from the seaware_academic repo. They don't seem to be the same as yours. For the molecule above, your SMILES are O.Cl.Cl.CN1CCN(CCN2CCN(CC2)C3CC(c4ccc(F)cc4)c5ccccc35)C1=O CHEMBL1237319 and mine are CN1CCN(CCN2CCN(C3CC(c4ccc(F)cc4)c4ccccc43)CC2)C1=O CHEMBL1237319 I wonder if they were preprocessed somehow before being deposited there. @mjke, do you know?

elcaceres commented 8 years ago

Ah, very interesting! @mmysinger do you know anything about this either?

mjke commented 8 years ago

Hi guys, generally we filter out ions and unconnected smaller structures in the standardization & cleanup scripts used to build the seaware libraries. I'd recommend we make this part of our normal cleanup as well. Does the ChEMBL standardiser not do this?

On Mon, Mar 21, 2016 at 3:34 PM, elcaceres notifications@github.com wrote:

Ah, very interesting! @mmysinger https://github.com/mmysinger do you know anything about this either?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/9#issuecomment-199519103

Michael J Keiser, PhD Assistant Professor, UCSF

Dept Pharmaceutical Chemistry Dept Bioengineering & Therapeutic Sciences Institute for Neurodegenerative Diseases

p/1-415-886-7651 keiserlab.org http://www.keiserlab.org calendly.com/mjke

elcaceres commented 8 years ago

So I've been doing standardization during all phases of ECFP generation, but not before. I think the disconnect with these conformers is that they were generated with the SMILES pulled straight from ChEMBL without standardization (thus "without the new --standardise flag" thing I said earlier) because I had (wrongly) assumed it was implemented in E3FP at the time. Standardization has since been applied to E3FP but these conformers are out of date with respect to that. Long story short: Conformers need to be regenerated with standardization. Standardiser does this, but I don't think ChEMBL does

sethaxen commented 8 years ago

Thanks, @mjke. I went back and double checked, and the only molecules that had these warnings for me were molecules I'd added myself directly from CHEMBL. It looks like the prefiltering that's been done for SEA is sufficient to do away with this problem. I'll still add the flag in case some poor future soul falls into this trap.

elcaceres commented 8 years ago

@sdaxen Awesome! As a note: the standardise function removes the Chlorines as well:

In [12]: print smi
Cl.NCCNS(=O)(=O)c1cc(C(=O)Nc2sc3CCCCc3c2C#N)c(Cl)cc1Cl

In [13]:  std_mol = standardise.apply(mol)

In [14]: Chem.MolToSmiles(std_mol)
Out[14]: 'N#Cc1c(NC(=O)c2cc(S(=O)(=O)NCCN)c(Cl)cc2Cl)sc2c1CCCC2'
sethaxen commented 8 years ago

Cool! Yeah I'm just thinking of the use case where someone runs conformer generation using some other method that leaves in the ions and then goes to do fingerprinting. We should just ignore those ions in the conformer.

elcaceres commented 8 years ago

Ah, sounds good. Good catch, @munick !