sirius-ms / sirius

SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
GNU Affero General Public License v3.0
84 stars 20 forks source link

How to access CANOPUS confidence scores? #32

Closed wkumler closed 3 years ago

wkumler commented 3 years ago

Hi again,

I've been using SIRIUS + CANOPUS from the command line to get potential compound classes for untargeted metabolomic data, but I can't figure out how to access the data that's available from the GUI. This kind of output is incredible and super informative, but I'd like to be able to access it programmatically:

image

Specifically, I'm interested in the posterior probability for each compound class. The classes themselves are available in the canopus_summary.tsv file that's written out for the project as a whole, but I'd like to filter out the low-confidence class estimations. I can't seem to find those values in the individual compound files either; the "canopus" folder contains only a .fpt file apparently containing raw floating-point values from an unknown process.

image

Any advice would be great!

kaibioinfo commented 3 years ago

In your project space directory there should be a canopus.tsv file. This file lists all compound classes with meta information and their relative index. The relative index (starting with 0) tells you which line in the canopus .fpt files belongs to which compound class.

Alternatively, you can use the canopus_treemap python library which contains code for parsing the compound classes from the project space.

wkumler commented 3 years ago

Ah, I think I understand! CANOPUS evaluates each compound's suitability for every compound class in ClassyFire, and the .fpt file gives the confidence associated with each class. So a 0.9999 in the very first line of my .fpt file corresponds to a 0.9999 match to "Organic compounds", which is the very first line of the canopus.tsv file? And similarly, a 0.0001 in the second line of my .fpt file corresponds to a 0.0001 match to "Inorganic compounds", which is the second line of the canopus.tsv?

kaibioinfo commented 3 years ago

Correct.

wkumler commented 3 years ago

Fantastic, thanks!