Closed ghost closed 3 years ago
Hi,
could you provide some examples for 4-6.
Maybe some PubChem IDs are now obsolete (PubChem changes its structure standardization every now and then and if something changes, compounds might get new IDs). CSI:FingerID only identifies 2D structures (atoms plus connectivity) . These can correspond to multiple 3D structures (having different stereochemistry).
Best, Marcus
Hi Marcus,
I wrote a jupyter notebook with some examples: github_sirius_issue.zip I thought it would be easier this way. Let me know if you need more clarification.
Thanks a lot! Nathalia
Hi Nathalia,
sorry it took me a while. I checked your data and to me it looks good.
For clarification:
formula_identifications.tsv
contains the best-scored molecular formula and compound_identifications.tsv
contains the best-scored structure. CSI:FingerID usually tests multiple molecular formulas (based on a score threshold) and sometimes finds a better structure database hit for, say, the top2 molecular formula. In this case the formula in the compound_identifications.tsv
is different.;
in the tsv file. I found each ID from your example notebook in PubChem. The tsv file contains the 2D structure InChI and InChIKey. The 3 PubChem compounds have different 3D InChIs (3D information is presented at the end of the string, this part with /t
, /m
and /b
). Hence, the InChI in the in the tsv is a prefix of the InChIs in PubChem. Btw, all have the same canonical SMILES.Notes:
Mine
, it is artificial. But as noted, PubChem also contains kind of "artificial" structures in a way because it just collects the data from a bunch of sources.Best, Marcus
Hi Marcus,
Thank you so much for the explanation and for taking the time to review my issue. I am learning a lot about SIRIUS from this. When I opened this issue, I was using SIRIUS 4.4 and I mainly wanted to get the CANOPUS classification (negative mode data), and for that I needed the full InChikey (I was using ClassyFireR). I updated to SIRIUS 4.5 now, so that problem got solved shortly after I posted my issue :)
Thank you for your time!
Nathalia
Ok, perfect. So this issue has been resolved.
Best, Marcus
Hello! I'd like to ask questions and make a few observations about the output files. I noticed a few inconsistencies with my data, and couldn't find answers anywhere. Hope you can help me out. I am using Sirius CLI (v.4.4.29) in ubuntu/linux. I've been running positive and negative mode data from environmental samples. Thanks in advance!
1 - Is there a way to find out which formulas/structures are predicted in silico or predicted from database matches?
2 - Why does Sirius output a partial InChIkey (InChIkey2D)?
3 - I've been
compound_identifications.tsv
as the final/summary output file. However, in my datasets, I noticed that the molecular formulas in the InChI and in the molecular formula column often don't match. I did some inspections comparing thestructure_candidates.tsv
andformula_candidates.tsv
of each feature, and found that Sirius wasn't selecting the first ranked structure and formula candidate. I was wondering if there is a reason for that. I updated my scripts to obtain the top ranked structure and formula candidate from each feature instead of using that summary output file. With this adaptation, the molecular formula match with the structure (InChI column).4 - I notived a few inconsistencies with PubChem ids. Some PubChem ids returned by Sirius don't exist when I search online.
5 - I also have got PubChem ids with a dot
.
, and searching for the id with the dot returns no value, and removing the dot including and excluding the numbers after the dot yields different compounds.6 - I am using a program that needs the full InChIkey. I wrote a script to get that from PubChem (from those features that at least one PubChem id is available). Because I noticed the issues I mentioned previously, I search PubChem using the PubChem id, and try to match the InChI, Smiles and the partial InChIkey (InChIkey2D) given by Sirius with the resulting PubChem properties for that id. In many instances, these three resulting properties (InChI, Smiles, InChIkey) don't match with what is given by Sirius. This is specially complicated when there is more than one PubChem id for a single feature. Sometimes I get two ids with matching all three, or none. It would be great to get some insight about how to sort this out.