Running sage output in percolator

timosachsenberg commented 1 year ago

Hi, thanks for your great tool and for making it open-source! I just ran into a small issue. If I call percolator on the sage output I get an error:

Started Mon Mar 20 10:45:47 2023
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Reading tab-delimited input from datafile results.sage.tsv

WARNING: Tab delimited input does not contain ScanNr column,
         scan numbers will be assigned automatically.

Features:
num_proteins
Exception caught: ERROR: Reading tab file, error reading PSM on line 2. Could not read label.

could it be percolator is case-sensitive and doesn't recognize "scannr"? Probably same for other columns?

lazear commented 1 year ago

Hi Timo,

IIRC percolator is case sensitive - I generally used Mokapot instead, which was a bit more flexible in terms of file format. (Percolator also requires "proteins" to be the final column, but that the list of proteins are tab delimited themselves. This obviously breaks compatibility with standard TSV semantics/parsers, which was another point in favor of Mokapot).

Its also possible that compatibility with Mokapot/Percolator has shifted over time 🙃 - let me do some testing. I'm happy to add an additional output format that is Percolator compatible

lazear commented 1 year ago

@timosachsenberg Which version of percolator are you targeting, and what flags are you using? I have started working on this and want to make sure we're compatible. Is a CLI flag for Sage sufficient (e.g. sage config.json --write-pin)?

Also, have you considered just skipping percolator and using the default LDA scoring from Sage? Initial testing (on 20-file IonStar dataset) shows that LDA outperforms percolator (and is much faster) - but perhaps I am not using the proper percolator flags

timosachsenberg commented 1 year ago

A command line flag is totally sufficient. I have to admit that I did not experiment a lot with the LDA option and I don’t know about the implementation details. e.g. did you test this on entrapment datasets to make sure it does not overfit? I would be a bit surprised/excited if it performs in general better than the C-SVM. I hope to be able to look into it soon.

lazear commented 1 year ago

I wouldn't get too excited - by "outperform" I mean within a percentage point or two (this is dataset or search-parameter dependent, and there are cases where percolator outperforms - actually, it does on the example below, with an extra +0.2% of peptides)

Here are the results from an entrapment search of a single file from PXD001468 (20742 human sequences, 100373 non-human sequences [plant, archae, bacteria, viruses])

lazear commented 1 year ago

And here's another entrapment search on the IonStar dataset (PXD003881, 20 files, human + ecoli, ~633k target PSMs q<0.01):

fcyu commented 1 year ago

In calculate the false match rate, did you multiple something like #{human peptides in database} / #{ecoli peptides in database}? If not, the false discovery rate is underestimated.

Best,

Fengchao

lazear commented 1 year ago

Hi Fengchao,

Thanks for the input - I am using the definition from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374549/

You're correct about the need for correcting for entrapment database size (~5x in this case) for an accurate FMR - not sure if that is the correct formula. I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences) (assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it 😄

I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)

Here is the above plot corrected (just using # of proteins, rather than tryptic sequences, because this is just a heuristic after all)

fcyu commented 1 year ago

I think it's (cumulative # of entrapments/total hits) * (# of all database sequences/# of entrapment sequences) (assuming false hits are iid among the database, they have a ~80% chance of being drawn from the entrapment sequences) - but don't hold me to it

I also find it hard to figure out the very precise multiplication factor because it relates to 1) how to perform the search (separated or concatnated); 2) how to calculate the FDR ( d/t or 2d/(d+t)); and 3) if excluding the entrapment hits in the calculation... I would rather not jump into that rabbit hole now, so I said "multiple something like..." ;)

I'm also not sure that it is completely relevant to the question at hand, which is whether LDA is bringing through more entrapment sequences than SVM (given that LDA q-value ~ SVM q-value, it is not)

Apologies for the diversion. I couldn't help but think of this question when I saw your appearance.

Best,

Fengchao

lazear commented 1 year ago

Thanks for the catch, definitely an oversight on my part!

lazear commented 1 year ago

@timosachsenberg, the new release build should have percolator support, please let me know if it works!

timosachsenberg commented 1 year ago

Thanks! Output looks good at first glance... I will probably next week find some time to complete the parser.

timosachsenberg commented 1 year ago

hmm Charge columns probably need to be 1-hot encoded. Did you try this in percolator?

lazear commented 1 year ago

With charge as an integer/categorical value, I get "418968 test set PSMs with q<0.01" (PXD001819). With charge one-hot encoded, I get "419171 test set PSMs with q<0.01".

I can push out another release with this change

lazear commented 1 year ago

@timosachsenberg new release is now available with one-hot encoding

timosachsenberg commented 1 year ago

Thanks. That got me one step further.

It now works in mokapot but in percolator I still get:

../THIRDPARTY/Linux/64bit/Percolator/percolator -U /home/sachsenb/results.sage.pin Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 Reading tab-delimited input from datafile /home/sachsenb/results.sage.pin Features:

Exception caught: ERROR: Reading tab file, too few features present.

thrown in https://github.com/percolator/percolator/blob/c34cafbb2f0a96b921ab86b4626980cbf25719da/src/SetHandler.cpp#L449

lazear commented 1 year ago

Hi Timo, can you provide some more details? I can run both mokapot & percolator successfully on (sigh) my machine.

Percolator version 3.06.0, Build Date May 11 2022 12:45:02
$ percolator -U results.sage.pin

Here's what the first lines of the generated .pin file look like for me:

SpecId	Label	ScanNr	ExpMass	CalcMass	FileName	retentiontime	rank	z=2	peptide_len	ln(precursor_ppm)	fragment_ppm	ln(hyperscore)	ln(delta_next)	aligned_rt	predicted_rt	sqrt(delta_rt_model)	matched_peaks	longest_y	longest_y_pct	ln(matched_intensity_pct)	scored_candidates	ln(-poisson)	Peptide	Proteins
0	1	113989	1903.3513	1909.8514	LFQ_Orbitrap_AIF_Ecoli_02.mzML	72.46999	1	1	17	8.132845	2.821103	3.490338882432969	2.8170563628745775	0.5245401	0.5254065	0.03162278	11	11	0.64705884	3.678523	855	2.222583976053586	FSGNYGNM[+15.9949]TEVSYQVAK	sp\|P76177\|YDGH_ECOLI
1	1	60950	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_02.mzML	38.802895	1	1	15	8.465419	2.6180208	3.4369142808821342	3.4369142808821342	0.27822053	0.2784992	0.03162278	10	10	0.6666667	3.6709907	819	2.1075992628966396	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
2	1	55136	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_03.mzML	35.08831	1	1	15	8.465419	2.0656602	3.448836534267552	3.448836534267552	0.27726635	0.2784992	0.035111718	10	10	0.6666667	3.712279	954	2.107782023996463	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
3	1	55442	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_03.mzML	35.2825	1	1	15	8.465419	3.3769214	3.439945634965063	3.439945634965063	0.2786852	0.2784992	0.03162278	10	10	0.6666667	3.65579	776	2.108171525973436	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
4	1	60797	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_01.mzML	38.70561	1	1	15	8.465419	2.7467542	3.448061036230416	3.448061036230416	0.27678335	0.2784992	0.041422687	10	10	0.6666667	3.8230393	808	2.1033762962406617	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
5	1	55289	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_03.mzML	35.185417	1	1	15	8.465419	3.3549595	3.5957309136789215	2.465971354498191	0.27797586	0.2784992	0.03162278	12	10	0.6666667	3.7506115	1016	2.3303709236891867	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
6	1	60950	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_01.mzML	38.802654	1	1	15	8.465419	3.2451446	3.428673461542098	3.428673461542098	0.27749372	0.2784992	0.03170916	10	9	0.6	3.569311	728	2.1092152454161295	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
7	1	60491	1591.2094	1583.692	LFQ_Orbitrap_AIF_Ecoli_01.mzML	38.51149	1	1	15	8.465419	2.5542252	3.4377434422488573	3.4377434422488573	0.27536252	0.2784992	0.056005932	10	10	0.6666667	3.6932373	640	2.099752337379419	MTSVGSQDTTGPM[+15.9949]TR	sp\|P36683\|ACNB_ECOLI
8	1	125383	1815.3113	1808.7776	LFQ_Orbitrap_AIF_Ecoli_02.mzML	79.70125	1	1	17	8.192353	2.5355973	3.415371823795108	3.415371823795108	0.5774464	0.5740991	0.057855662	10	9	0.5294118	3.0685194	1210	2.1146302576038103	ASEGFGEDTYLTSTM[+15.9949]GK	sp\|P33363\|BGLX_ECOLI

timosachsenberg commented 1 year ago

will try with 3.06 (currently on 3.05) and report back

timosachsenberg commented 1 year ago

I currently get a segfault for one of my files in percolator. See https://github.com/percolator/percolator/issues/352 could be a problem in 3.06 if I did not mess anything up...

lazear commented 1 year ago

Are there any unresolved issues left? Seems like the previous issue was due to running without decoys, and isn't an issue with Sage's output

timosachsenberg commented 1 year ago

I will circle back soon... got side tracked. I think it is solved...

lazear / sage

Running sage output in percolator #53