neilswainston / FragGenie

MIT License
15 stars 8 forks source link

Should isomers have the same fragments? #3

Open adamoyoung opened 2 years ago

adamoyoung commented 2 years ago

If I have two molecules that are isomers according to org.openscience.cdk.isomorphism.UniversalIsomorphismTester, should they produce the same fragments?

neilswainston commented 2 years ago

I'm not sure. If these are true isomers, then yes. But my brief glance at org.openscience.cdk.isomorphism.UniversalIsomorphismTester suggests that this class also checks for sub-molecules, in which case the fragments will be different.

adamoyoung commented 2 years ago

Thanks for the response! I was using UniversalIsomorphismTester.isIsomorph() to check if the two strings are the same. According to [the documentation](https://cdk.github.io/cdk/1.5/docs/api/org/openscience/cdk/isomorphism/UniversalIsomorphismTester.html#UniversalIsomorphismTester()), I think this should tell me if the resulting molecules have the same atoms and bonds.

One of the example strings that you give in test_input.csv (line 2) is caffeine (Cn1cnc2n(C)c(=O)n(C)c(=O)c12). I tried re-canonicalizing caffeine in cdk using the following code:

final SmilesParser parser = new SmilesParser(SilentChemObjectBuilder.getInstance());
final IAtomContainer molecule = parser.parseSmiles(smiles);
final SmilesGenerator smigen = new SmilesGenerator(SmiFlavor.Unique | SmiFlavor.UseAromaticSymbols);
final String newSmiles = smigen.create(molecule);

This gave me the new string O=c1c2c(ncn2C)n(c(=O)n1C)C.

When I tried this new string with FragGenie, I got significantly different results! I tried debugging myself but I was struggling a bit. It might be that this is how FragGenie is supposed to work, I'm not sure.

To reproduce the bug (?) try running test.sh where test_input.csv has the following lines:

smiles
Cn1cnc2n(C)c(=O)n(C)c(=O)c12
O=c1c2c(ncn2C)n(c(=O)n1C)C

This should give you the following result:

smiles,METFRAG_MZ
Cn1cnc2n(C)c(=O)n(C)c(=O)c12,"[86.02366, 87.055305, 94.01616, 95.047806, 100.026726, 100.02673, 115.05022, 123.04272, 150.0172, 151.03763, 152.06927, 180.06418, 195.08766]"
O=c1c2c(ncn2C)n(c(=O)n1C)C,"[94.01615, 95.047806, 100.026726, 123.04271, 123.04272, 150.0172, 152.06926, 152.06927, 165.04068, 195.08765]"

Just as a sanity check, I used PubChem to confirm that these strings are indeed the same (so it's not just cdk being weird):

Cn1cnc2n(C)c(=O)n(C)c(=O)c12: https://pubchem.ncbi.nlm.nih.gov/#query=Cn1cnc2n(C)c(%3DO)n(C)c(%3DO)c12 O=c1c2c(ncn2C)n(c(=O)n1C)C: https://pubchem.ncbi.nlm.nih.gov/#query=O%3Dc1c2c(ncn2C)n(c(%3DO)n1C)C

adamoyoung commented 2 years ago

For what it's worth, Pubchem sketcher also thinks they are the same