rdkit / rdkit

The official sources for the RDKit library
BSD 3-Clause "New" or "Revised" License
2.59k stars 864 forks source link

SmilesMolSupplier bales read loop early if bad SMILES on last line. #2479

Closed DavidACosgrove closed 5 years ago

DavidACosgrove commented 5 years ago

Description:

When reading a SMILES file with SmilesMolSupplier, if there is a parse error on the last line, the error is not easy to catch. It appears that the iterator finishes the loop before the error-handling code is activated. In the code below, the second loop should produce the message Record 7 not read. as for the first loop, but doesn't.

#!/usr/bin/env python

from rdkit import Chem

suppl1 = Chem.SmilesMolSupplier('test1.smi', titleLine=False, nameColumn=1)
rec_num = 0
for mol in suppl1:
    rec_num += 1
    if not mol:
        print('Record {} not read.'.format(rec_num))
    else:
        print('Record {} read ok.'.format(rec_num))

suppl2 = Chem.SmilesMolSupplier('test2.smi', titleLine=False, nameColumn=1)
rec_num = 0
for mol in suppl2:
    rec_num += 1
    if not mol:
        print('Record {} not read.'.format(rec_num))
    else:
        print('Record {} read ok.'.format(rec_num))

test1.smi:

c1ccccc  duff
c1ccccc1 ok
c1ccncc1 pyridine
C(C garbage
C1CC1 ok2
C1C(Cl)C1 ok3
C1C(Cl)CCCC duff2
CCCCC pentane

test2.smi:

c1ccccc  duff
c1ccccc1 ok
c1ccncc1 pyridine
C(C garbage
C1CC1 ok2
C1C(Cl)C1 ok3
C1C(Cl)CCCC duff2
greglandrum commented 5 years ago

Confirmed. I will take a look.

greglandrum commented 5 years ago

What do you know... there's even a FIX: comment in the code about this: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Wrap/MolSupplier.h#L38

DavidACosgrove commented 5 years ago

Hi Greg,

I guess you can set the prioritisation score by some inverse function of the time between your writing the FIX: comment and someone complaining!

Thanks, Dave

On Wed, 5 Jun 2019 at 05:52, Greg Landrum notifications@github.com wrote:

What do you know... there's even a FIX: comment in the code about this:

https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Wrap/MolSupplier.h#L38

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rdkit/rdkit/issues/2479?email_source=notifications&email_token=ACGF2FSHJNEKTVWCZ2MRMSLPY5BAJA5CNFSM4HS33OQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6S63Y#issuecomment-498937711, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGF2FXVBZXTHMQFZAJJWO3PY5BAJANCNFSM4HS33OQA .

-- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk