Closed anushka255 closed 3 months ago
Thanks @anushka255 - with the parsing bug fixed, there seem to be 1091 valid smiles in the PED dataset out of 1102 in the raw file. Others are being filtered out for different reasons. Does this match with your numbers?
Here's what I'm seeing - 1102 -> 1101 (one has >3 heavy atoms) -> 1097 (4 have Hg
, not a valid element in our list) -> 1091 with unique inchikeys. Let me know if this sounds right.
Yes, here's the breakdown of 11 that are still being parsed out:
remove_salts_solvents
stage['COC(CNC(=O)C1CCC(C)(C(=O)O)C1(C)C)C[Hg]SCC(=O)O', 'COC(CNC(N)=O)C[Hg]Cl', 'COC(C[Hg+])CNC(=O)c1ccccc1OCC(=O)O', 'COC(C[Hg])CNC(=O)NC(=O)CCC(=O)O']
inchikeys, indices = np.unique(inchikeys, return_index=True)
Just closing the loop on this - I went through the six that were removed for redundancy, five of them are duplicates of the same compound that are in the CSV file in different charge states (so this is the code working as it should) and one of them was an erroneous structure in the CSV file - so good to catch it!
fixed in commit 2877a795fc4adc6a349552d5cd1abdde0367f2d5.
Currently we're reading all the input smiles using the
read_file
function which parses SMILES from rest of the items in a line by splitting it by comma. https://github.com/skinniderlab/CLM/blob/ffe169a184f73ca105cd9999d8beb22215a85cec/src/clm/functions.py#L197This seems to have incorrectly parsed some SMILES with names that contain commas in them.
For example: line
"17-Hydroxy-18a-homo-19-nor-17ɑpregna-4,9,11-trien-3-one",C#CC1(O)CCC2C3CCC4=CC(=O)CCC4C3CCC21CC,WWYNJERNGUHSAO-UHFFFAOYSA-N
is parsed into9