Open const-ae opened 4 years ago
We just added some converters for MaxQuant: https://github.com/statisticalbiotechnology/triqler/wiki/Converters
It uses a MaxQuant output file called evidence.txt and it's best to raise the FDR threshold to include more decoys. We have run some tests ourselves, but it's still a bit experimental so it would be interesting to see your results!
Thank you for your quick reply.
I did see the converter for MaxQuant, which is very helpful. But the problem is that the protein names in my evidence file never contain the prefix decoy_
. Thus, after conversion triqler throws an error that no decoy peptides were found.
My question now is: is there a specific setting in MaxQuant that I need to turn on to get those decoy peptides?
If not, I do find some proteins that have the prefix REV__
, so do I have to call triqler like this:
triqler --decoy_pattern REV__ triqler_input.tsv
That's correct, the REV__ proteins correspond to the decoys and you can use the command you suggested. The only thing is that at the default FDR of 1% you will get very few decoys, so we recommend setting the FDR threshold to e.g. 100%. This will both allow triqler to potentially use more peptides and also results in better calibration of the error estimates.
Thank you very much, that information is very helpful already.
Just two make sure I run the program correctly, you mean the PSM FDR in the Identification
tab of MaxQuant?
And a second question, what options are there for normalizing the intensities between samples? The example data that I am trying to analyze has a fairly large effect to one side (1/3 of all proteins are upregulated in one condtion), which causes a bias in the remaining proteins. Is there any option in triqler to correct for this if I want to identify differentially abundant proteins?
You have to set both the PSM and peptide level FDR to 100%.
EDIT: Apparently one also has to set the protein level FDR to 100% to get all the decoys in MaxQuant.
There is currently only one option for normalization, which assumes that the majority of proteins do not change between samples. In your case, this might indeed not be ideal and it would be wise to add the --skip_normalization
flag when you use the MaxQuant->Triqler converter. We have actually been struggling a bit ourselves how to handle datasets like yours without creating biases, so if you have any suggestions they would be very welcome.
You have to set both the PSM and peptide level FDR to 100%.
Thanks, that is good to know.
if you have any suggestions they would be very welcome
Yes, I agree that proper normalization is quite challenging, especially if the effect is so large in one direction. In this particular case, I use the fact that the experiment contains UPS standard protein spike-ins in addition to the human and E. coli proteins. I do a simple median normalization on those proteins. And to check if the normalization was successful, I check that the p-value histogram for the human proteins is uniform.
I just ran some tests and it seems that you also need to set the protein level FDR to 100% to get all the decoys in MaxQuant, sorry for the inconvenience..
Regarding the normalization, in the same spirit as your approach, it would be possible to implement a function that only uses peptides that do not vary "too much" for normalization, e.g. less than
Thank you again for helping me getting the program running :)
Do you think that is a reasonable approach?
Yes, if there is a set for which we a priori know that they shouldn't differ between samples than one should use those peptides for the normalization.
I have now managed to run MaxQuant with the right settings and also do the downstream analysis with triqler.
The problem is that on this dataset triqler seems to have problems controlling the false discovery rate (FDR). Which, I guess, is not too surprising, because without a good normalization all human proteins will look shifted. But just to make sure that I am not missing something obvious: the q_value
in the output table is the is the local FDR, defined by Storey, 2003, right? And the posterior_error_prob
is a Bayesian version of the p-value?
Alright, I'll look into this alternative normalization procedure, is it correct that you're currently skipping the normalization in the converter? I also realized that the set you're looking at is fractionated, which would probably completely mess up our current normalization procedure anyway.
Our q-value
is the one from Storey 2003, though it is not the local FDR, as this term is usually reserved for the derivative of the FDR, which happens to be what we report as posterior_error_prob
, see e.g. Efron 2005
Also, are you using matches-between-runs? So far we have tested without MBR, as we expect there to be extra errors coming from this process which are not yet included in Triqler.
[...] is it correct that you're currently skipping the normalization in the converter?
Yes, exactly.
I also realized that the set you're looking at is fractionated, which would probably completely mess up our current normalization procedure anyway.
Okay, thanks that is good to know.
Our q-value is the one from Storey 2003, though it is not the local FDR, [...] hich happens to be what we report as posterior_error_prob
Thank you for the correction. I will go through those papers again to make sure that I don't confuse the terms. Though, the FDR problem persists.
Also, are you using matches-between-runs?
I run MaxQuant with MBR, but filter out all rows in the evidence file that were quantified with MBR and not directly measured, because I got parsing errors on the complete file. But thank you for the link to your paper, I will also add that one to my reading list :)
Alright, that all sounds good! I'm still working on the alternative normalization procedure.
Very interesting to see your results, how far off are the FDR estimates? Would you be able to share your Triqler log file and perhaps even the Triqler input/output? We have previously had some problems estimating hyperparameters on mixed species datasets, due to the bi/multimodality of the distributions.
Great.
Well, quite a bit but that is what I observe for all tools if the data is not normalized. Here is the log file from triqler (slurm.bx06-17.39308207.log) which I ran with the following command:
triqler \
--fold_change_eval 0 \
--decoy_pattern REV__ \
--out_file MS_results_tmp3/triqler_result.tsv \
triqler_input_cond_fixed.tsv
For the input / output I will try to see if I find a good platform to share, because they are quite big (evidence: 1.1 GB, triqler_input: 265 MB, triqler_result: 15 MB).
I uploaded the files to the owncloud of my institute: https://oc.embl.de/index.php/s/Ag9smCQYuyR8DDS (The link will stop working in two weeks).
Thanks a lot! I downloaded the files and will take a look tomorrow.
From the logs I already noticed that something is going wrong with calculating the peptide and protein identification error probabilities. Also, --fold-change-eval
ìs meant to be set to a value higher than 0, usually somewhere in the range between 0.8 and 1.2.
From the logs I already noticed that something is going wrong with calculating the peptide and protein identification error probabilities
Okay interesting. Thanks for letting me know.
--fold-change-eval
ìs meant to be set to a value higher than 0, usually somewhere in the range between 0.8 and 1.2.
From the documentation, I thought that it would be a filter on the fold change and everything with a smaller fold change would be ignored. Is that wrong? I agree that for a biological analysis it can make sense to ignore significant proteins with a very small fold change, but for the benchmarking it should be fine to keep them.
The --fold-change-eval
flag does not filter out proteins, but provides a 'soft threshold' relative to which we calculate the probability that a give protein has a fold change greater than the specified value. This is indeed a bit different from how fold change thresholds are usually applied, but it is far more natural in the Bayesian framework.
I looked at your data and there are indeed some issues that cause Triqler some problems:
python -m triqler.distribution.plot_hyperparameter_fits
) indeed showed that this distribution looks bimodal due to the mixed-species setup. Triqler defaults to a set of hyperparameters based on previously analyzed datasets, but there are unfortunately no guarantees that these are reasonable.
Hi Matthew,
I am currently trying to run triqler on some existing dataset (PXD000279) for a benchmarking experiment.
My problem is that I don't manage to get the input into a format that is acceptable to triqler. My biggest problem is that, as far as I understand, triqler needs the peptides from decoy proteins to estimate the error rate. But I cannot figure out how to make MaxQuant output the decoy hits. I do see that MaxQuant provides a column called
Reverse
, is that one equivalent?Best Regards, Constantin