Closed timosachsenberg closed 1 year ago
Hi Timo,
IIRC percolator is case sensitive - I generally used Mokapot instead, which was a bit more flexible in terms of file format. (Percolator also requires "proteins" to be the final column, but that the list of proteins are tab delimited themselves. This obviously breaks compatibility with standard TSV semantics/parsers, which was another point in favor of Mokapot).
Its also possible that compatibility with Mokapot/Percolator has shifted over time ๐ - let me do some testing. I'm happy to add an additional output format that is Percolator compatible
@timosachsenberg Which version of percolator are you targeting, and what flags are you using? I have started working on this and want to make sure we're compatible. Is a CLI flag for Sage sufficient (e.g. sage config.json --write-pin
)?
Also, have you considered just skipping percolator and using the default LDA scoring from Sage? Initial testing (on 20-file IonStar dataset) shows that LDA outperforms percolator (and is much faster) - but perhaps I am not using the proper percolator flags
A command line flag is totally sufficient. I have to admit that I did not experiment a lot with the LDA option and I donโt know about the implementation details. e.g. did you test this on entrapment datasets to make sure it does not overfit? I would be a bit surprised/excited if it performs in general better than the C-SVM. I hope to be able to look into it soon.
I wouldn't get too excited - by "outperform" I mean within a percentage point or two (this is dataset or search-parameter dependent, and there are cases where percolator outperforms - actually, it does on the example below, with an extra +0.2% of peptides)
Here are the results from an entrapment search of a single file from PXD001468 (20742 human sequences, 100373 non-human sequences [plant, archae, bacteria, viruses])
And here's another entrapment search on the IonStar dataset (PXD003881, 20 files, human + ecoli, ~633k target PSMs q<0.01):
In calculate the false match rate, did you multiple something like #{human peptides in database} / #{ecoli peptides in database}
? If not, the false discovery rate is underestimated.
Best,
Fengchao
Hi Fengchao,
Thanks for the input - I am using the definition from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374549/
You're correct about the need for correcting for entrapment database size (~5x in this case) for an accurate FMR - not sure if that is the correct formula. I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences)
(assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it ๐
I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)
Here is the above plot corrected (just using # of proteins, rather than tryptic sequences, because this is just a heuristic after all)
I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences) (assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it
I also find it hard to figure out the very precise multiplication factor because it relates to 1) how to perform the search (separated or concatnated); 2) how to calculate the FDR ( d/t
or 2d/(d+t)
); and 3) if excluding the entrapment hits in the calculation... I would rather not jump into that rabbit hole now, so I said "multiple something like..." ;)
I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)
Apologies for the diversion. I couldn't help but think of this question when I saw your appearance.
Best,
Fengchao
Thanks for the catch, definitely an oversight on my part!
@timosachsenberg, the new release build should have percolator support, please let me know if it works!
Thanks! Output looks good at first glance... I will probably next week find some time to complete the parser.
hmm Charge columns probably need to be 1-hot encoded. Did you try this in percolator?
With charge as an integer/categorical value, I get "418968 test set PSMs with q<0.01" (PXD001819). With charge one-hot encoded, I get "419171 test set PSMs with q<0.01".
I can push out another release with this change
@timosachsenberg new release is now available with one-hot encoding
Thanks. That got me one step further.
It now works in mokapot but in percolator I still get:
../THIRDPARTY/Linux/64bit/Percolator/percolator -U /home/sachsenb/results.sage.pin Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 Reading tab-delimited input from datafile /home/sachsenb/results.sage.pin Features:
Exception caught: ERROR: Reading tab file, too few features present.
Hi Timo, can you provide some more details? I can run both mokapot & percolator successfully on (sigh) my machine.
Percolator version 3.06.0, Build Date May 11 2022 12:45:02
$ percolator -U results.sage.pin
Here's what the first lines of the generated .pin file look like for me:
SpecId | Label | ScanNr | ExpMass | CalcMass | FileName | retentiontime | rank | z=2 | z=3 | z=4 | z=5 | z=6 | z=other | peptide_len | missed_cleavages | isotope_error | ln(precursor_ppm) | fragment_ppm | ln(hyperscore) | ln(delta_next) | ln(delta_best) | aligned_rt | predicted_rt | sqrt(delta_rt_model) | matched_peaks | longest_b | longest_y | longest_y_pct | ln(matched_intensity_pct) | scored_candidates | ln(-poisson) | Peptide | Proteins |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 113989 | 1903.3513 | 1909.8514 | LFQ_Orbitrap_AIF_Ecoli_02.mzML | 72.46999 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 17 | 0 | 0.0 | 8.132845 | 2.821103 | 3.490338882432969 | 2.8170563628745775 | 0.0 | 0.5245401 | 0.5254065 | 0.03162278 | 11 | 0 | 11 | 0.64705884 | 3.678523 | 855 | 2.222583976053586 | FSGNYGNM[+15.9949]TEVSYQVAK | sp|P76177|YDGH_ECOLI |
1 | 1 | 60950 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_02.mzML | 38.802895 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 2.6180208 | 3.4369142808821342 | 3.4369142808821342 | 0.0 | 0.27822053 | 0.2784992 | 0.03162278 | 10 | 0 | 10 | 0.6666667 | 3.6709907 | 819 | 2.1075992628966396 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
2 | 1 | 55136 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_03.mzML | 35.08831 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 2.0656602 | 3.448836534267552 | 3.448836534267552 | 0.0 | 0.27726635 | 0.2784992 | 0.035111718 | 10 | 0 | 10 | 0.6666667 | 3.712279 | 954 | 2.107782023996463 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
3 | 1 | 55442 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_03.mzML | 35.2825 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 3.3769214 | 3.439945634965063 | 3.439945634965063 | 0.0 | 0.2786852 | 0.2784992 | 0.03162278 | 10 | 0 | 10 | 0.6666667 | 3.65579 | 776 | 2.108171525973436 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
4 | 1 | 60797 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_01.mzML | 38.70561 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 2.7467542 | 3.448061036230416 | 3.448061036230416 | 0.0 | 0.27678335 | 0.2784992 | 0.041422687 | 10 | 0 | 10 | 0.6666667 | 3.8230393 | 808 | 2.1033762962406617 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
5 | 1 | 55289 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_03.mzML | 35.185417 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 3.3549595 | 3.5957309136789215 | 2.465971354498191 | 0.0 | 0.27797586 | 0.2784992 | 0.03162278 | 12 | 0 | 10 | 0.6666667 | 3.7506115 | 1016 | 2.3303709236891867 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
6 | 1 | 60950 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_01.mzML | 38.802654 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 3.2451446 | 3.428673461542098 | 3.428673461542098 | 0.0 | 0.27749372 | 0.2784992 | 0.03170916 | 10 | 0 | 9 | 0.6 | 3.569311 | 728 | 2.1092152454161295 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
7 | 1 | 60491 | 1591.2094 | 1583.692 | LFQ_Orbitrap_AIF_Ecoli_01.mzML | 38.51149 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0.0 | 8.465419 | 2.5542252 | 3.4377434422488573 | 3.4377434422488573 | 0.0 | 0.27536252 | 0.2784992 | 0.056005932 | 10 | 0 | 10 | 0.6666667 | 3.6932373 | 640 | 2.099752337379419 | MTSVGSQDTTGPM[+15.9949]TR | sp|P36683|ACNB_ECOLI |
8 | 1 | 125383 | 1815.3113 | 1808.7776 | LFQ_Orbitrap_AIF_Ecoli_02.mzML | 79.70125 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 17 | 0 | 0.0 | 8.192353 | 2.5355973 | 3.415371823795108 | 3.415371823795108 | 0.0 | 0.5774464 | 0.5740991 | 0.057855662 | 10 | 0 | 9 | 0.5294118 | 3.0685194 | 1210 | 2.1146302576038103 | ASEGFGEDTYLTSTM[+15.9949]GK | sp|P33363|BGLX_ECOLI |
will try with 3.06 (currently on 3.05) and report back
I currently get a segfault for one of my files in percolator. See https://github.com/percolator/percolator/issues/352 could be a problem in 3.06 if I did not mess anything up...
Are there any unresolved issues left? Seems like the previous issue was due to running without decoys, and isn't an issue with Sage's output
I will circle back soon... got side tracked. I think it is solved...
Hi, thanks for your great tool and for making it open-source! I just ran into a small issue. If I call percolator on the sage output I get an error:
could it be percolator is case-sensitive and doesn't recognize "scannr"? Probably same for other columns?