skinniderlab / CLM

MIT License
0 stars 0 forks source link

Parsing SMILES in preprocess #203

Closed anushka255 closed 3 months ago

anushka255 commented 4 months ago

Currently we're reading all the input smiles using the read_file function which parses SMILES from rest of the items in a line by splitting it by comma. https://github.com/skinniderlab/CLM/blob/ffe169a184f73ca105cd9999d8beb22215a85cec/src/clm/functions.py#L197

This seems to have incorrectly parsed some SMILES with names that contain commas in them.

For example: line "17-Hydroxy-18a-homo-19-nor-17ɑpregna-4,9,11-trien-3-one",C#CC1(O)CCC2C3CCC4=CC(=O)CCC4C3CCC21CC,WWYNJERNGUHSAO-UHFFFAOYSA-N is parsed into 9

vineetbansal commented 4 months ago

Thanks @anushka255 - with the parsing bug fixed, there seem to be 1091 valid smiles in the PED dataset out of 1102 in the raw file. Others are being filtered out for different reasons. Does this match with your numbers?

vineetbansal commented 4 months ago

Here's what I'm seeing - 1102 -> 1101 (one has >3 heavy atoms) -> 1097 (4 have Hg, not a valid element in our list) -> 1091 with unique inchikeys. Let me know if this sounds right.

anushka255 commented 4 months ago

Yes, here's the breakdown of 11 that are still being parsed out:

skinnider commented 3 months ago

Just closing the loop on this - I went through the six that were removed for redundancy, five of them are duplicates of the same compound that are in the CSV file in different charge states (so this is the code working as it should) and one of them was an erroneous structure in the CSV file - so good to catch it!

vineetbansal commented 3 months ago

fixed in commit 2877a795fc4adc6a349552d5cd1abdde0367f2d5.