lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
210 stars 39 forks source link

Running sage output in percolator #53

Closed timosachsenberg closed 1 year ago

timosachsenberg commented 1 year ago

Hi, thanks for your great tool and for making it open-source! I just ran into a small issue. If I call percolator on the sage output I get an error:

Started Mon Mar 20 10:45:47 2023
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Reading tab-delimited input from datafile results.sage.tsv

WARNING: Tab delimited input does not contain ScanNr column,
         scan numbers will be assigned automatically.

Features:
num_proteins
Exception caught: ERROR: Reading tab file, error reading PSM on line 2. Could not read label.

could it be percolator is case-sensitive and doesn't recognize "scannr"? Probably same for other columns?

lazear commented 1 year ago

Hi Timo,

IIRC percolator is case sensitive - I generally used Mokapot instead, which was a bit more flexible in terms of file format. (Percolator also requires "proteins" to be the final column, but that the list of proteins are tab delimited themselves. This obviously breaks compatibility with standard TSV semantics/parsers, which was another point in favor of Mokapot).

Its also possible that compatibility with Mokapot/Percolator has shifted over time ๐Ÿ™ƒ - let me do some testing. I'm happy to add an additional output format that is Percolator compatible

lazear commented 1 year ago

@timosachsenberg Which version of percolator are you targeting, and what flags are you using? I have started working on this and want to make sure we're compatible. Is a CLI flag for Sage sufficient (e.g. sage config.json --write-pin)?

Also, have you considered just skipping percolator and using the default LDA scoring from Sage? Initial testing (on 20-file IonStar dataset) shows that LDA outperforms percolator (and is much faster) - but perhaps I am not using the proper percolator flags

timosachsenberg commented 1 year ago

A command line flag is totally sufficient. I have to admit that I did not experiment a lot with the LDA option and I donโ€™t know about the implementation details. e.g. did you test this on entrapment datasets to make sure it does not overfit? I would be a bit surprised/excited if it performs in general better than the C-SVM. I hope to be able to look into it soon.

lazear commented 1 year ago

I wouldn't get too excited - by "outperform" I mean within a percentage point or two (this is dataset or search-parameter dependent, and there are cases where percolator outperforms - actually, it does on the example below, with an extra +0.2% of peptides)

Here are the results from an entrapment search of a single file from PXD001468 (20742 human sequences, 100373 non-human sequences [plant, archae, bacteria, viruses])

image

image

lazear commented 1 year ago

And here's another entrapment search on the IonStar dataset (PXD003881, 20 files, human + ecoli, ~633k target PSMs q<0.01): image

fcyu commented 1 year ago

In calculate the false match rate, did you multiple something like #{human peptides in database} / #{ecoli peptides in database}? If not, the false discovery rate is underestimated.

Best,

Fengchao

lazear commented 1 year ago

Hi Fengchao,

Thanks for the input - I am using the definition from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374549/

You're correct about the need for correcting for entrapment database size (~5x in this case) for an accurate FMR - not sure if that is the correct formula. I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences) (assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it ๐Ÿ˜„

I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)

Here is the above plot corrected (just using # of proteins, rather than tryptic sequences, because this is just a heuristic after all) image

fcyu commented 1 year ago

I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences) (assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it

I also find it hard to figure out the very precise multiplication factor because it relates to 1) how to perform the search (separated or concatnated); 2) how to calculate the FDR ( d/t or 2d/(d+t)); and 3) if excluding the entrapment hits in the calculation... I would rather not jump into that rabbit hole now, so I said "multiple something like..." ;)

I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)

Apologies for the diversion. I couldn't help but think of this question when I saw your appearance.

Best,

Fengchao

lazear commented 1 year ago

Thanks for the catch, definitely an oversight on my part!

lazear commented 1 year ago

@timosachsenberg, the new release build should have percolator support, please let me know if it works!

timosachsenberg commented 1 year ago

Thanks! Output looks good at first glance... I will probably next week find some time to complete the parser.

timosachsenberg commented 1 year ago

hmm Charge columns probably need to be 1-hot encoded. Did you try this in percolator?

lazear commented 1 year ago

With charge as an integer/categorical value, I get "418968 test set PSMs with q<0.01" (PXD001819). With charge one-hot encoded, I get "419171 test set PSMs with q<0.01".

I can push out another release with this change

lazear commented 1 year ago

@timosachsenberg new release is now available with one-hot encoding

timosachsenberg commented 1 year ago

Thanks. That got me one step further.

It now works in mokapot but in percolator I still get:

../THIRDPARTY/Linux/64bit/Percolator/percolator -U /home/sachsenb/results.sage.pin Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 Reading tab-delimited input from datafile /home/sachsenb/results.sage.pin Features:

Exception caught: ERROR: Reading tab file, too few features present.

thrown in https://github.com/percolator/percolator/blob/c34cafbb2f0a96b921ab86b4626980cbf25719da/src/SetHandler.cpp#L449

lazear commented 1 year ago

Hi Timo, can you provide some more details? I can run both mokapot & percolator successfully on (sigh) my machine.

Percolator version 3.06.0, Build Date May 11 2022 12:45:02
$ percolator -U results.sage.pin

Here's what the first lines of the generated .pin file look like for me:

SpecId Label ScanNr ExpMass CalcMass FileName retentiontime rank z=2 z=3 z=4 z=5 z=6 z=other peptide_len missed_cleavages isotope_error ln(precursor_ppm) fragment_ppm ln(hyperscore) ln(delta_next) ln(delta_best) aligned_rt predicted_rt sqrt(delta_rt_model) matched_peaks longest_b longest_y longest_y_pct ln(matched_intensity_pct) scored_candidates ln(-poisson) Peptide Proteins
0 1 113989 1903.3513 1909.8514 LFQ_Orbitrap_AIF_Ecoli_02.mzML 72.46999 1 1 0 0 0 0 0 17 0 0.0 8.132845 2.821103 3.490338882432969 2.8170563628745775 0.0 0.5245401 0.5254065 0.03162278 11 0 11 0.64705884 3.678523 855 2.222583976053586 FSGNYGNM[+15.9949]TEVSYQVAK sp|P76177|YDGH_ECOLI
1 1 60950 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_02.mzML 38.802895 1 1 0 0 0 0 0 15 0 0.0 8.465419 2.6180208 3.4369142808821342 3.4369142808821342 0.0 0.27822053 0.2784992 0.03162278 10 0 10 0.6666667 3.6709907 819 2.1075992628966396 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
2 1 55136 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_03.mzML 35.08831 1 1 0 0 0 0 0 15 0 0.0 8.465419 2.0656602 3.448836534267552 3.448836534267552 0.0 0.27726635 0.2784992 0.035111718 10 0 10 0.6666667 3.712279 954 2.107782023996463 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
3 1 55442 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_03.mzML 35.2825 1 1 0 0 0 0 0 15 0 0.0 8.465419 3.3769214 3.439945634965063 3.439945634965063 0.0 0.2786852 0.2784992 0.03162278 10 0 10 0.6666667 3.65579 776 2.108171525973436 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
4 1 60797 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_01.mzML 38.70561 1 1 0 0 0 0 0 15 0 0.0 8.465419 2.7467542 3.448061036230416 3.448061036230416 0.0 0.27678335 0.2784992 0.041422687 10 0 10 0.6666667 3.8230393 808 2.1033762962406617 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
5 1 55289 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_03.mzML 35.185417 1 1 0 0 0 0 0 15 0 0.0 8.465419 3.3549595 3.5957309136789215 2.465971354498191 0.0 0.27797586 0.2784992 0.03162278 12 0 10 0.6666667 3.7506115 1016 2.3303709236891867 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
6 1 60950 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_01.mzML 38.802654 1 1 0 0 0 0 0 15 0 0.0 8.465419 3.2451446 3.428673461542098 3.428673461542098 0.0 0.27749372 0.2784992 0.03170916 10 0 9 0.6 3.569311 728 2.1092152454161295 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
7 1 60491 1591.2094 1583.692 LFQ_Orbitrap_AIF_Ecoli_01.mzML 38.51149 1 1 0 0 0 0 0 15 0 0.0 8.465419 2.5542252 3.4377434422488573 3.4377434422488573 0.0 0.27536252 0.2784992 0.056005932 10 0 10 0.6666667 3.6932373 640 2.099752337379419 MTSVGSQDTTGPM[+15.9949]TR sp|P36683|ACNB_ECOLI
8 1 125383 1815.3113 1808.7776 LFQ_Orbitrap_AIF_Ecoli_02.mzML 79.70125 1 1 0 0 0 0 0 17 0 0.0 8.192353 2.5355973 3.415371823795108 3.415371823795108 0.0 0.5774464 0.5740991 0.057855662 10 0 9 0.5294118 3.0685194 1210 2.1146302576038103 ASEGFGEDTYLTSTM[+15.9949]GK sp|P33363|BGLX_ECOLI
timosachsenberg commented 1 year ago

will try with 3.06 (currently on 3.05) and report back

timosachsenberg commented 1 year ago

I currently get a segfault for one of my files in percolator. See https://github.com/percolator/percolator/issues/352 could be a problem in 3.06 if I did not mess anything up...

lazear commented 1 year ago

Are there any unresolved issues left? Seems like the previous issue was due to running without decoys, and isn't an issue with Sage's output

timosachsenberg commented 1 year ago

I will circle back soon... got side tracked. I think it is solved...