sirius-ms / sirius

SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
GNU Affero General Public License v3.0
90 stars 23 forks source link

Sirius output files #12

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hello! I'd like to ask questions and make a few observations about the output files. I noticed a few inconsistencies with my data, and couldn't find answers anywhere. Hope you can help me out. I am using Sirius CLI (v.4.4.29) in ubuntu/linux. I've been running positive and negative mode data from environmental samples. Thanks in advance!

1 - Is there a way to find out which formulas/structures are predicted in silico or predicted from database matches?

2 - Why does Sirius output a partial InChIkey (InChIkey2D)?

3 - I've been compound_identifications.tsv as the final/summary output file. However, in my datasets, I noticed that the molecular formulas in the InChI and in the molecular formula column often don't match. I did some inspections comparing the structure_candidates.tsv and formula_candidates.tsv of each feature, and found that Sirius wasn't selecting the first ranked structure and formula candidate. I was wondering if there is a reason for that. I updated my scripts to obtain the top ranked structure and formula candidate from each feature instead of using that summary output file. With this adaptation, the molecular formula match with the structure (InChI column).

4 - I notived a few inconsistencies with PubChem ids. Some PubChem ids returned by Sirius don't exist when I search online.

5 - I also have got PubChem ids with a dot ., and searching for the id with the dot returns no value, and removing the dot including and excluding the numbers after the dot yields different compounds.

6 - I am using a program that needs the full InChIkey. I wrote a script to get that from PubChem (from those features that at least one PubChem id is available). Because I noticed the issues I mentioned previously, I search PubChem using the PubChem id, and try to match the InChI, Smiles and the partial InChIkey (InChIkey2D) given by Sirius with the resulting PubChem properties for that id. In many instances, these three resulting properties (InChI, Smiles, InChIkey) don't match with what is given by Sirius. This is specially complicated when there is more than one PubChem id for a single feature. Sometimes I get two ids with matching all three, or none. It would be great to get some insight about how to sort this out.

marcus-ludwig commented 4 years ago

Hi,

could you provide some examples for 4-6.

Maybe some PubChem IDs are now obsolete (PubChem changes its structure standardization every now and then and if something changes, compounds might get new IDs). CSI:FingerID only identifies 2D structures (atoms plus connectivity) . These can correspond to multiple 3D structures (having different stereochemistry).

Best, Marcus

ghost commented 4 years ago

Hi Marcus,

I wrote a jupyter notebook with some examples: github_sirius_issue.zip I thought it would be easier this way. Let me know if you need more clarification.

Thanks a lot! Nathalia

marcus-ludwig commented 4 years ago

Hi Nathalia,

sorry it took me a while. I checked your data and to me it looks good.

For clarification:

Notes:

Best, Marcus

ghost commented 3 years ago

Hi Marcus,

Thank you so much for the explanation and for taking the time to review my issue. I am learning a lot about SIRIUS from this. When I opened this issue, I was using SIRIUS 4.4 and I mainly wanted to get the CANOPUS classification (negative mode data), and for that I needed the full InChikey (I was using ClassyFireR). I updated to SIRIUS 4.5 now, so that problem got solved shortly after I posted my issue :)

Thank you for your time!

Nathalia

marcus-ludwig commented 3 years ago

Ok, perfect. So this issue has been resolved.

Best, Marcus