stsouko / CGRtools

CGRs, molecules and reactions manipulation
GNU Lesser General Public License v3.0
2 stars 0 forks source link

SMILESRead failing to return most SMILES #173

Closed JHucker closed 3 years ago

JHucker commented 3 years ago

On delimited data, having issues with SMILESRead i.e. upon calling read(), only a fraction of the results are returned. However, manually iterating over the same with smiles() generally returns them all. While I can use smiles() as a workaround, it would be great to use SMILESRead for parsing the other columns in as metadata.

See below example (using python 3.8.10 and CGRtools 4.1.20):

# This is an excerpt of 1976_Sep2016_USPTOgrants_smiles.rsmi
example_text = """ReactionSmiles    PatentNumber    ParagraphNum    Year    TextMinedYield  CalculatedYield
[Br:1][CH2:2][CH2:3][OH:4].[CH2:5]([S:7](Cl)(=[O:9])=[O:8])[CH3:6].CCOCC>C(N(CC)CC)C>[CH2:5]([S:7]([O:4][CH2:3][CH2:2][Br:1])(=[O:9])=[O:8])[CH3:6] US03930836      1976        
[Br:1][CH2:2][CH2:3][CH2:4][OH:5].[CH3:6][S:7](Cl)(=[O:9])=[O:8].CCOCC>C(N(CC)CC)C>[CH3:6][S:7]([O:5][CH2:4][CH2:3][CH2:2][Br:1])(=[O:9])=[O:8] US03930836      1976        
[CH2:1]([Cl:4])[CH2:2][OH:3].CCOCC.[CH2:10]([S:14](Cl)(=[O:16])=[O:15])[CH:11]([CH3:13])[CH3:12]>C(N(CC)CC)C>[CH2:10]([S:14]([O:3][CH2:2][CH2:1][Cl:4])(=[O:16])=[O:15])[CH:11]([CH3:13])[CH3:12]   US03930836      1976        
[Br:1][CH2:2][CH2:3][OH:4].[CH2:5]([S:7](Cl)(=[O:9])=[O:8])[CH3:6].CCOCC>C(N(CC)CC)C>[CH2:5]([S:7]([O:4][CH2:3][CH2:2][Br:1])(=[O:9])=[O:8])[CH3:6] US03930839      1976        
[Br:1][CH2:2][CH2:3][CH2:4][OH:5].[CH3:6][S:7](Cl)(=[O:9])=[O:8].CCOCC>C(N(CC)CC)C>[CH3:6][S:7]([O:5][CH2:4][CH2:3][CH2:2][Br:1])(=[O:9])=[O:8] US03930839      1976        
[CH2:1]([Cl:4])[CH2:2][OH:3].CCOCC.[CH2:10]([S:14](Cl)(=[O:16])=[O:15])[CH:11]([CH3:13])[CH3:12]>C(N(CC)CC)C>[CH2:10]([S:14]([O:3][CH2:2][CH2:1][Cl:4])(=[O:16])=[O:15])[CH:11]([CH3:13])[CH3:12]   US03930839      1976        
[Cl:1][C:2]1[N:3]=[CH:4][C:5]2[C:10]([CH:11]=1)=[C:9]([N+:12]([O-])=O)[CH:8]=[CH:7][CH:6]=2.O.[OH-].[Na+]>C(O)(=O)C.[Fe]>[Cl:1][C:2]1[N:3]=[CH:4][C:5]2[C:10]([CH:11]=1)=[C:9]([NH2:12])[CH:8]=[CH:7][CH:6]=2 |f:2.3|   US03930837      1976        
[CH3:1][C:2]1[N+:3]([O-])=[CH:4][C:5]2[C:10]([CH:11]=1)=[C:9]([N+:12]([O-:14])=[O:13])[CH:8]=[CH:7][CH:6]=2.P(Cl)(Cl)([Cl:18])=O>>[Cl:18][C:4]1[C:5]2[C:10](=[C:9]([N+:12]([O-:14])=[O:13])[CH:8]=[CH:7][CH:6]=2)[CH:11]=[C:2]([CH3:1])[N:3]=1  US03930837      1976        
[CH3:1][C:2]1[N:3]=[CH:4][C:5]2[C:10]([CH:11]=1)=[C:9]([N+:12]([O-:14])=[O:13])[CH:8]=[CH:7][CH:6]=2.[ClH:15]>>[ClH:15].[CH3:1][C:2]1[N:3]=[CH:4][C:5]2[C:10]([CH:11]=1)=[C:9]([N+:12]([O-:14])=[O:13])[CH:8]=[CH:7][CH:6]=2 |f:2.3|    US03930837      1976        
CC1N=CC2C(C=1)=C([N+]([O-])=O)C=CC=2.[Cl:15][C:16]1[C:25]2[C:20](=[CH:21][CH:22]=[CH:23][CH:24]=2)[CH:19]=[CH:18][N:17]=1>>[ClH:15].[Cl:15][C:16]1[C:25]2[C:20](=[CH:21][CH:22]=[CH:23][CH:24]=2)[CH:19]=[CH:18][N:17]=1 |f:2.3|    US03930837      1976        
CC1N=CC2C(C=1)=C([N+]([O-])=O)C=CC=2.[Cl:15][C:16]1[CH:25]=[CH:24][C:23]([N+:26]([O-:28])=[O:27])=[C:22]2[C:17]=1[CH:18]=[CH:19][N:20]=[CH:21]2.Cl.CC1N=CC2C(C=1)=C([N+]([O-])=O)C=CC=2.[IH:44]>>[IH:44].[Cl:15][C:16]1[CH:25]=[CH:24][C:23]([N+:26]([O-:28])=[O:27])=[C:22]2[C:17]=1[CH:18]=[CH:19][N:20]=[CH:21]2 |f:2.3,5.6| US03930837      1976        
[N+:1]([C:4]1[CH:13]=[CH:12][CH:11]=[C:10]2[C:5]=1[CH:6]=[CH:7][N:8]=[CH:9]2)([O-:3])=[O:2].[BrH:14]>C(O)C>[BrH:14].[N+:1]([C:4]1[CH:13]=[CH:12][CH:11]=[C:10]2[C:5]=1[CH:6]=[CH:7][N:8]=[CH:9]2)([O-:3])=[O:2] |f:3.4| US03930837      1976        
[N+](C1C=CC=C2C=1C=CN=C2)([O-])=O.[CH3:14][C:15]1[C:24]2[C:19](=[CH:20][CH:21]=[CH:22][CH:23]=2)[CH:18]=[CH:17][N:16]=1.Br.[Cl:26][C:27]1[C:32]([OH:33])=[C:31]([Cl:34])[C:30]([Cl:35])=[C:29]([Cl:36])[C:28]=1[Cl:37]>>[Cl:26][C:27]1[C:32]([O-:33])=[C:31]([Cl:34])[C:30]([Cl:35])=[C:29]([Cl:36])[C:28]=1[Cl:37].[CH3:14][C:15]1[C:24]2[C:19](=[CH:20][CH:21]=[CH:22][CH:23]=2)[CH:18]=[CH:17][NH+:16]=1 |f:4.5| US03930837      1976        
[N+:1]([C:4]1[CH:13]=[CH:12][CH:11]=[C:10]2[C:5]=1[CH:6]=[CH:7][N:8]=[CH:9]2)([O-])=O.NC1C=CC=C2C=1C=CN=C2.Br.[IH:26]>>[IH:26].[IH:26].[NH2:1][C:4]1[CH:13]=[CH:12][CH:11]=[C:10]2[C:5]=1[CH:6]=[CH:7][N:8]=[CH:9]2 |f:4.5.6|   US03930837      1976        
Cl.[OH:2][C@@H:3]([CH2:21][CH2:22][CH2:23][CH2:24][CH3:25])[CH:4]=[CH:5][CH:6]1[CH:10]=[CH:9][C:8](=[O:11])[CH:7]1[CH2:12][CH:13]=[CH:14][CH2:15][CH2:16][CH2:17][C:18]([OH:20])=[O:19]>C(O)C>[OH:2][C@@H:3]([CH2:21][CH2:22][CH2:23][CH2:24][CH3:25])[CH:4]=[CH:5][CH:6]1[CH2:10][CH2:9][C:8](=[O:11])[CH:7]1[CH2:12][CH:13]=[CH:14][CH2:15][CH2:16][CH2:17][C:18]([OH:20])=[O:19] US03930952      1976        
CC(O[CH2:5][C:6]1[CH2:28][S:27][C@@H:9]2[C@H:10]([NH:13]C(C(OC(C)=O)C3C=CC=CC=3)=O)[C:11](=[O:12])[N:8]2[C:7]=1[C:29]([OH:31])=[O:30])=O>O>[CH3:5][C:6]1[CH2:28][S:27][C@@H:9]2[C@H:10]([NH2:13])[C:11](=[O:12])[N:8]2[C:7]=1[C:29]([OH:31])=[O:30] US03930949      1976        
[S:1]([O-:5])([O-:4])(=[O:3])=[O:2].[NH4+:6].[NH4+]>O>[S:1](=[O:3])(=[O:2])([OH:5])[O-:4].[NH4+:6].[S:1]([O-:5])([O-:4])(=[O:3])=[O:2].[NH4+:6].[NH4+:6] |f:0.1.2,4.5,6.7.8|    US03930988      1976        
CO[C:3]1[CH:4]=[C:5]([C:9]2([CH2:12][C:13]([Cl:16])([Cl:15])[Cl:14])[CH2:11][O:10]2)[CH:6]=[CH:7][CH:8]=1.ClC1C=C(C2(CC(Cl)(Cl)Cl)CO2)C=CC=1.FC1C=C(C2(CC(Cl)(Cl)Cl)CO2)C=CC=1.ClC1C=C(C2(CC(Cl)(Cl)Cl)CO2)C=CC=1Cl.C(OC1C=C(C2(CC(Cl)(Cl)Cl)CO2)C=CC=1)C.C(OC1C=C(C2(CC(Cl)(Cl)Cl)CO2)C=CC=1)C1C=CC=CC=1.ClC1C=CC(C2(CC(Cl)(Cl)Cl)CO2)=CC=1.[Br:117]C1C=CC(C2(CC(Cl)(Cl)Cl)CO2)=CC=1>>[Br:117][C:3]1[CH:4]=[C:5]([C:9]2([CH2:12][C:13]([Cl:16])([Cl:15])[Cl:14])[CH2:11][O:10]2)[CH:6]=[CH:7][CH:8]=1  US03930835      1976        
[C:1]1(O)[CH:6]=[CH:5][CH:4]=[CH:3][CH:2]=1.[CH2:8]=[O:9].[S:10]([O-:13])([O-:12])=[O:11].[Na+:14].[Na+]>O>[OH:9][CH:8]([S:10]([O-:13])(=[O:12])=[O:11])[C:1]1[CH:6]=[CH:5][CH:4]=[CH:3][CH:2]=1.[Na+:14] |f:2.3.4,6.7| US03931083      1976        
[CH3:1][O:2][C:3]1[C:4]([C:13]([OH:15])=O)=[CH:5][C:6]2[C:11]([CH:12]=1)=[CH:10][CH:9]=[CH:8][CH:7]=2.S(Cl)([Cl:18])=O>C1C=CC=CC=1>[CH3:1][O:2][C:3]1[C:4]([C:13]([Cl:18])=[O:15])=[CH:5][C:6]2[C:11]([CH:12]=1)=[CH:10][CH:9]=[CH:8][CH:7]=2   US03931103      1976        
[C:1]1([O:7]C(Cl)=O)[CH:6]=[CH:5][CH:4]=[CH:3][CH:2]=1.C(Cl)Cl.[OH2:14].[OH-].[Na+].C(N([CH2:22][CH3:23])CC)C>>[CH:2]1[CH:3]=[C:22]([CH2:23][C:2]2[C:1]([OH:7])=[CH:6][CH:5]=[CH:4][CH:3]=2)[C:5]([OH:14])=[CH:6][CH:1]=1 |f:3.4|   US03931108      1976        
[CH3:1][C:2]1[C:3](=[CH:7][C:8](=[CH:12][CH:13]=1)[N:9]=[C:10]=[O:11])N=C=O.[NH2:14][C:15]([O:17]CC)=O>>[CH2:2]1[CH:3]([CH2:1][CH:2]2[CH2:13][CH2:12][CH:8]([N:9]=[C:10]=[O:11])[CH2:7][CH2:3]2)[CH2:7][CH2:8][CH:12]([N:14]=[C:15]=[O:17])[CH2:13]1    US03931113      1976        
C1CC[CH:4]([N:7]=C=[N:7][CH:4]2CCC[CH2:2][CH2:3]2)[CH2:3][CH2:2]1.[N:16]1([C:24]([O:26][CH2:27][C:28]2[CH:33]=[CH:32][CH:31]=[CH:30][CH:29]=2)=[O:25])[CH2:23][CH2:22][CH2:21][C@H:17]1[C:18]([OH:20])=[O:19].C1C=CC2N(O)N=NC=2C=1.C(N)CC>O1CCCC1>[N:16]1([C:24]([O:26][CH2:27][C:28]2[CH:29]=[CH:30][CH:31]=[CH:32][CH:33]=2)=[O:25])[CH2:23][CH2:22][CH2:21][C@H:17]1[C:18]([OH:20])=[O:19].[CH2:4]([NH-:7])[CH2:3][CH3:2] |f:5.6|    US03931139      1976        
[N:1]1([C:9]([O:11][CH2:12][C:13]2[CH:18]=[CH:17][CH:16]=[CH:15][CH:14]=2)=[O:10])[CH2:8][CH2:7][CH2:6][C@H:2]1[C:3]([OH:5])=[O:4].C(OC(Cl)=O)C.[CH2:25]([NH2:31])[CH2:26][CH2:27][CH2:28][CH2:29][CH3:30]>O1CCCC1>[N:1]1([C:9]([O:11][CH2:12][C:13]2[CH:14]=[CH:15][CH:16]=[CH:17][CH:18]=2)=[O:10])[CH2:8][CH2:7][CH2:6][C@H:2]1[C:3]([OH:5])=[O:4].[CH2:25]([NH-:31])[CH2:26][CH2:27][CH2:28][CH2:29][CH3:30] |f:4.5|    US03931139      1976        
[IH:1].CS[C:4]1[NH:5][CH2:6][CH2:7][CH2:8][CH2:9][N:10]=1.C(O)C.O.[NH2:15][NH2:16]>CCOCC>[IH:1].[NH:15]([C:4]1[NH:5][CH2:6][CH2:7][CH2:8][CH2:9][N:10]=1)[NH2:16] |f:0.1,3.4,6.7|   US03931152      1976        
C1C(=O)N([Br:8])C(=O)C1.[CH3:9][N:10]1[C:16]2[CH:17]=[CH:18][CH:19]=[CH:20][C:15]=2[C:14](=[O:21])[CH2:13][C:12]2[CH:22]=[CH:23][CH:24]=[CH:25][C:11]1=2>CN(C)C=O>[Br:8][C:19]1[CH:18]=[CH:17][C:16]2[N:10]([CH3:9])[C:11]3[CH:25]=[CH:24][CH:23]=[CH:22][C:12]=3[CH2:13][C:14](=[O:21])[C:15]=2[CH:20]=1   US03931151      1976        100.5%
[Br:1][C:2]1[CH:18]=[CH:17][C:5]2[N:6]([CH3:16])[C:7]3[CH:15]=[CH:14][CH:13]=[CH:12][C:8]=3[CH2:9][C:10](=[O:11])[C:4]=2[CH:3]=1.[CH2:19](O)[CH3:20].C([O-])([O-])OCC.C1(C)C=CC(S(O)(=O)=O)=CC=1>C(N(CC)CC)C>[Br:1][C:2]1[CH:18]=[CH:17][C:5]2[N:6]([CH3:16])[C:7]3[CH:15]=[CH:14][CH:13]=[CH:12][C:8]=3[CH:9]=[C:10]([O:11][CH2:19][CH3:20])[C:4]=2[CH:3]=1    US03931151      1976        
[CH2:1]([S:3][C:4]1[CH:26]=[CH:25][C:7]2[N:8]([CH3:24])[C:9]3[CH:23]=[CH:22][CH:21]=[CH:20][C:10]=3[CH2:11][C:12](O)([CH2:13][C:14]([O:16][CH2:17][CH3:18])=[O:15])[C:6]=2[CH:5]=1)[CH3:2].Cl>C(O)C>[CH2:1]([S:3][C:4]1[CH:26]=[CH:25][C:7]2[N:8]([CH3:24])[C:9]3[CH:23]=[CH:22][CH:21]=[CH:20][C:10]=3[CH2:11][C:12](=[CH:13][C:14]([O:16][CH2:17][CH3:18])=[O:15])[C:6]=2[CH:5]=1)[CH3:2] US03931151      1976        82.0%
[CH2:1]([S:3][C:4]1[CH:25]=[CH:24][C:7]2[N:8]([CH3:23])[C:9]3[CH:22]=[CH:21][CH:20]=[CH:19][C:10]=3[CH2:11][C:12](=[CH:13][C:14]([O:16]CC)=[O:15])[C:6]=2[CH:5]=1)[CH3:2].[OH-].[K+].Cl>C(O)C>[CH2:1]([S:3][C:4]1[CH:25]=[CH:24][C:7]2[N:8]([CH3:23])[C:9]3[CH:22]=[CH:21][CH:20]=[CH:19][C:10]=3[CH:11]=[C:12]([CH2:13][C:14]([OH:16])=[O:15])[C:6]=2[CH:5]=1)[CH3:2] |f:1.2|  US03931151      1976        78.1%
[CH2:1]([S:3][C:4]1[CH:23]=[CH:22][C:7]2[N:8]([CH3:21])[C:9]3[CH:20]=[CH:19][CH:18]=[CH:17][C:10]=3[CH:11]=C(CC(O)=O)[C:6]=2[CH:5]=1)[CH3:2].[CH:24]1[C:29]([N+:30]([O-:32])=[O:31])=[CH:28][CH:27]=[C:26]([OH:33])[CH:25]=1.[CH:34]1(N=C=NC2CCCCC2)CCCCC1.[C:49](OCC)(=[O:51])[CH3:50]>>[CH2:1]([S:3][C:4]1[CH:23]=[CH:22][C:7]2[N:8]([CH3:21])[C:9]3[CH:20]=[CH:19][CH:18]=[CH:17][C:10]=3[CH:11]=[C:50]([C:49]([O:33][C:26]3[CH:27]=[CH:28][C:29]([N+:30]([O-:32])=[O:31])=[CH:24][CH:25]=3)=[O:51])[C:6]=2[C:5]=1[CH3:34])[CH3:2]   US03931151      1976        
"""

from CGRtools.files import *
from CGRtools import smiles

# Setup example
fname = "first_30_USPTOgrants.rsmi"
f = open(fname, "a")
f.write(example_text)
f.close()

# Try SMILESRead
smi_reader = SMILESRead(fname, header=True)
reader_result = smi_reader.read()

# 7 SMILES retrieved
print(len(reader_result))
for smi in reader_result:
    print(smi)

# Read line-by-line with smiles, skip header
f = open(fname, "r")
lines = f.readlines()
smiles_result = []
for line in lines[1:]:
    smi = line.split("\t")[0]
    parsed_smi = smiles(smi)
    smiles_result.append(parsed_smi)

f.close()

# All 30 SMILES retrieved
print(len(smiles_result))
for smi in smiles_result:
    print(smi)
stsouko commented 3 years ago

Hi! By default all file readers skip data with errors. For SMILESRead extra checks are made:

You can skip these checks by passing ignore argument.

SMILESRead(fname, header=True, ignore=True)

However, SMILESRead supports only simple cases of metadata parsing. For you it is better to use next pipeline:

for record in csv.DictReader(io.StringIO(example_text), delimiter='\t'):
    record['reaction'] = smiles(record['ReactionSmiles'])

But you have in example smiles not covered by opensmiles spec:

 [N+](C1C=CC=C2C=1C=CN=C2)([O-])=O.[CH3:14][C:15]1[C:24]2[C:19](=[CH:20][CH:21]=[CH:22][CH:23]=2)[CH:18]=[CH:17][N:16]=1.Br.[Cl:26][C:27]1[C:32]([OH:33])=[C:31]([Cl:34])[C:30]([Cl:35])=[C:29]([Cl:36])[C:28]=1[Cl:37]>>[Cl:26][C:27]1[C:32]([O-:33])=[C:31]([Cl:34])[C:30]([Cl:35])=[C:29]([Cl:36])[C:28]=1[Cl:37].[CH3:14][C:15]1[C:24]2[C:19](=[CH:20][CH:21]=[CH:22][CH:23]=2)[CH:18]=[CH:17][NH+:16]=1 |f:4.5|

|f:4.5| - information about components contracting. This data not supported for now. This feature in todo list.

JHucker commented 3 years ago

Both of those solutions work great and noted re the extended SMILES functionality, thanks for your assistance.