statisticalbiotechnology / triqler

The triqler (TRansparent Identification-Quantification-linked Error Rates)'s source and example code
Apache License 2.0
19 stars 9 forks source link

FASTA DB requirements #23

Open tobiasko opened 1 year ago

tobiasko commented 1 year ago

In your manuscript entitled "Triqler for Protein Summarization of Data from Data-Independent Acquisition Mass Spectrometry" you state that:

"The pipeline generated decoys for FDR calculations, which were discarded after DIA-NN processing. To circumvent the lack of decoys in output for Triqler, we concatenated shuffled entrapment sequences in the FASTA database."

Could you explain what these shuffled entrapment sequences are? Is this something one needs to add if the DIA-NN reports should be useable for triqler?

patruong commented 1 year ago

Hi Tobias,

Triqler needs decoys to calculate the Q-value. However, the PSMs in the report.tsv output from DIA-NN usually are not mapped to decoy proteins. To circumvent this, DIA-NN can be run with a spectra library that includes shuffled entrapment sequences. To do this, you first add shuffled entrapment sequences to your FASTA file before constructing a spectral library. These shuffled entrapment sequences are basically shuffled amino acid sequences of the proteins in the FASTA file.

Alternatively, you could use OpenSwathDecoyGenerator to add decoys to your spectral library, but this method has crashed in a couple of data sets on which I have tried this on. I am not sure why.

Hope this clarifies.

tobiasko commented 1 year ago

Hmmm...How would I do this when DIA-NN was run in library-free mode? I thought DIA-NN is already using decoys internally, because it outputs a Decoy.Evidence and Decoy.CScore for each feature in the main report. This can't be used by triqler?

tobiasko commented 1 year ago

The library-free search starts the in silico digestion from a target-only FASTA database. I guess decoy generation happens on peptide or library level. One can write the resulting spectral lib to disc and it contains a column Decoy. I hence guess the lib is supplemented with decoy targets/transitions.

patruong commented 1 year ago

Indeed, DIA-NN is already using decoy peptides internally to compute the FDRs. However, these decoy-peptides cannot be printed into the output report.tsv.

I am not entirely sure what Decoy.Evidence and Decoy.CScore are used for, but they are floats and Triqler denotes if they are decoys or not by parsing the prefix to a protein, i.e. a binary indicator.

See DIA-NN generated decoy peptides: https://github.com/vdemichev/DiaNN/issues/6 DIA-NN cannot generate the internally generated decoys as decoy proteins as output: https://github.com/vdemichev/DiaNN/issues/117 DIA-NN cannot generate the internally generated decoy peptides: https://github.com/vdemichev/DiaNN/issues/468

tobiasko commented 1 year ago

Well I guess those floats are the scores and evidence values of the corresponding decoy entry. Instead of adding a new line for each decoy, it just denotes how the decoy scored (skipping the details of how the decoy entity is structured).

patruong commented 1 year ago

Hmm interesting... I thought about that too, but I could not find any information about how to threshold the scoring. Perhaps the same threshold as Mass.Evidence where values between 0.5-1.0 are considered decoys. Perhaps the Decoy.Evidence could be mapped to a binary indicator for the decoy PSM and then the protein belonging to these peptides could be marked as decoys. Let me think about this. Perhaps @MatthewThe can give some more feedback on this?

tobiasko commented 1 year ago

Let's ask Vadim what it really contains ;-) I also couldn't find any documentation on this.

tobiasko commented 1 year ago

Do it get the suggestion of Clemens correctly: He generates a target + decoy FASTA DB with a specific decoy prefix (50% target + 50% decoy). Runs this through DIA-NN (which generates internally decoys of decoys) only to get explicit reporting? That sounds pretty wild! And if the decoy function uses sequence reversal a decoy of a decoy turns into a target again.

patruong commented 1 year ago

Hmmm.. seems like it is redundant information.

Having a fasta file of 50/50 ratio target-decoy is correct. However, you might need to generate a separate spectral library before running DIA-NN in library-mode. I can't recall if it worked with a FASTA-file without spectral library, but for sure with a spectral library it will work.

Hahaha that's just funny :D... However, I'm not sure it works that way when they generate their decoy peptides.