MaxQuant Input - Githubissues

const-ae commented 4 years ago

Hi Matthew,

I am currently trying to run triqler on some existing dataset (PXD000279) for a benchmarking experiment.

My problem is that I don't manage to get the input into a format that is acceptable to triqler. My biggest problem is that, as far as I understand, triqler needs the peptides from decoy proteins to estimate the error rate. But I cannot figure out how to make MaxQuant output the decoy hits. I do see that MaxQuant provides a column called Reverse, is that one equivalent?

Best Regards, Constantin

MatthewThe commented 4 years ago

We just added some converters for MaxQuant: https://github.com/statisticalbiotechnology/triqler/wiki/Converters

It uses a MaxQuant output file called evidence.txt and it's best to raise the FDR threshold to include more decoys. We have run some tests ourselves, but it's still a bit experimental so it would be interesting to see your results!

const-ae commented 4 years ago

Thank you for your quick reply.

I did see the converter for MaxQuant, which is very helpful. But the problem is that the protein names in my evidence file never contain the prefix decoy_. Thus, after conversion triqler throws an error that no decoy peptides were found.

My question now is: is there a specific setting in MaxQuant that I need to turn on to get those decoy peptides?

If not, I do find some proteins that have the prefix REV__, so do I have to call triqler like this:

triqler --decoy_pattern REV__ triqler_input.tsv

MatthewThe commented 4 years ago

That's correct, the REV__ proteins correspond to the decoys and you can use the command you suggested. The only thing is that at the default FDR of 1% you will get very few decoys, so we recommend setting the FDR threshold to e.g. 100%. This will both allow triqler to potentially use more peptides and also results in better calibration of the error estimates.

const-ae commented 4 years ago

Thank you very much, that information is very helpful already. Just two make sure I run the program correctly, you mean the PSM FDR in the Identification tab of MaxQuant?

And a second question, what options are there for normalizing the intensities between samples? The example data that I am trying to analyze has a fairly large effect to one side (1/3 of all proteins are upregulated in one condtion), which causes a bias in the remaining proteins. Is there any option in triqler to correct for this if I want to identify differentially abundant proteins?

MatthewThe commented 4 years ago

You have to set both the PSM and peptide level FDR to 100%.

EDIT: Apparently one also has to set the protein level FDR to 100% to get all the decoys in MaxQuant.

There is currently only one option for normalization, which assumes that the majority of proteins do not change between samples. In your case, this might indeed not be ideal and it would be wise to add the --skip_normalization flag when you use the MaxQuant->Triqler converter. We have actually been struggling a bit ourselves how to handle datasets like yours without creating biases, so if you have any suggestions they would be very welcome.

const-ae commented 4 years ago

You have to set both the PSM and peptide level FDR to 100%.

Thanks, that is good to know.

if you have any suggestions they would be very welcome

Yes, I agree that proper normalization is quite challenging, especially if the effect is so large in one direction. In this particular case, I use the fact that the experiment contains UPS standard protein spike-ins in addition to the human and E. coli proteins. I do a simple median normalization on those proteins. And to check if the normalization was successful, I check that the p-value histogram for the human proteins is uniform.

MatthewThe commented 4 years ago

I just ran some tests and it seems that you also need to set the protein level FDR to 100% to get all the decoys in MaxQuant, sorry for the inconvenience..

Regarding the normalization, in the same spirit as your approach, it would be possible to implement a function that only uses peptides that do not vary "too much" for normalization, e.g. less than log2 fold changes, where can be set as a parameter. Do you think that is a reasonable approach?

const-ae commented 4 years ago

Thank you again for helping me getting the program running :)

Do you think that is a reasonable approach?

Yes, if there is a set for which we a priori know that they shouldn't differ between samples than one should use those peptides for the normalization.

I have now managed to run MaxQuant with the right settings and also do the downstream analysis with triqler. The problem is that on this dataset triqler seems to have problems controlling the false discovery rate (FDR). Which, I guess, is not too surprising, because without a good normalization all human proteins will look shifted. But just to make sure that I am not missing something obvious: the q_value in the output table is the is the local FDR, defined by Storey, 2003, right? And the posterior_error_prob is a Bayesian version of the p-value?

MatthewThe commented 4 years ago

Alright, I'll look into this alternative normalization procedure, is it correct that you're currently skipping the normalization in the converter? I also realized that the set you're looking at is fractionated, which would probably completely mess up our current normalization procedure anyway.

Our q-value is the one from Storey 2003, though it is not the local FDR, as this term is usually reserved for the derivative of the FDR, which happens to be what we report as posterior_error_prob, see e.g. Efron 2005

Also, are you using matches-between-runs? So far we have tested without MBR, as we expect there to be extra errors coming from this process which are not yet included in Triqler.

const-ae commented 4 years ago

[...] is it correct that you're currently skipping the normalization in the converter?

Yes, exactly.

I also realized that the set you're looking at is fractionated, which would probably completely mess up our current normalization procedure anyway.

Okay, thanks that is good to know.

Our q-value is the one from Storey 2003, though it is not the local FDR, [...] hich happens to be what we report as posterior_error_prob

Thank you for the correction. I will go through those papers again to make sure that I don't confuse the terms. Though, the FDR problem persists.

Also, are you using matches-between-runs?

I run MaxQuant with MBR, but filter out all rows in the evidence file that were quantified with MBR and not directly measured, because I got parsing errors on the complete file. But thank you for the link to your paper, I will also add that one to my reading list :)

MatthewThe commented 4 years ago

Alright, that all sounds good! I'm still working on the alternative normalization procedure.

Very interesting to see your results, how far off are the FDR estimates? Would you be able to share your Triqler log file and perhaps even the Triqler input/output? We have previously had some problems estimating hyperparameters on mixed species datasets, due to the bi/multimodality of the distributions.

const-ae commented 4 years ago

Great.

Well, quite a bit but that is what I observe for all tools if the data is not normalized. Here is the log file from triqler (slurm.bx06-17.39308207.log) which I ran with the following command:

triqler \
  --fold_change_eval 0 \
  --decoy_pattern REV__  \
  --out_file MS_results_tmp3/triqler_result.tsv \
  triqler_input_cond_fixed.tsv

For the input / output I will try to see if I find a good platform to share, because they are quite big (evidence: 1.1 GB, triqler_input: 265 MB, triqler_result: 15 MB).

const-ae commented 4 years ago

I uploaded the files to the owncloud of my institute: https://oc.embl.de/index.php/s/Ag9smCQYuyR8DDS (The link will stop working in two weeks).

MatthewThe commented 4 years ago

Thanks a lot! I downloaded the files and will take a look tomorrow.

From the logs I already noticed that something is going wrong with calculating the peptide and protein identification error probabilities. Also, --fold-change-eval ìs meant to be set to a value higher than 0, usually somewhere in the range between 0.8 and 1.2.

const-ae commented 4 years ago

From the logs I already noticed that something is going wrong with calculating the peptide and protein identification error probabilities

Okay interesting. Thanks for letting me know.

--fold-change-eval ìs meant to be set to a value higher than 0, usually somewhere in the range between 0.8 and 1.2.

From the documentation, I thought that it would be a filter on the fold change and everything with a smaller fold change would be ignored. Is that wrong? I agree that for a biological analysis it can make sense to ignore significant proteins with a very small fold change, but for the benchmarking it should be fine to keep them.

MatthewThe commented 4 years ago

The --fold-change-eval flag does not filter out proteins, but provides a 'soft threshold' relative to which we calculate the probability that a give protein has a fold change greater than the specified value. This is indeed a bit different from how fold change thresholds are usually applied, but it is far more natural in the Bayesian framework.

I looked at your data and there are indeed some issues that cause Triqler some problems:

Triqler considered each fractionated run as its own sample, which caused problems in calculating posteriors. I fixed the MaxQuant->Triqler converter to handle fractionated samples, where runs are now grouped based on the 'Experiment' column. You can try this out by rerunning the converter from the master branch.
There was indeed a problem estimating the hyperparameters, which caused the warning message regarding mu/sigmaDetect. Plotting the hyperparameter fits (python -m triqler.distribution.plot_hyperparameter_fits) indeed showed that this distribution looks bimodal due to the mixed-species setup. Triqler defaults to a set of hyperparameters based on previously analyzed datasets, but there are unfortunately no guarantees that these are reasonable.
I tried to implement the normalization procedure I proposed earlier, but it seems that the 'baseline' variance is already in the same range as the actual difference between human and E coli. We could indeed still think about setting up a list of proteins/peptides that should be used as the baseline, though this solution feels somewhat dissatisfying. I did implement a per-fraction normalization procedure now, but you'll probably still want to skip this.
From the first results, I can indeed see that the FDR is far from controlled. The main issue seems to be that many human proteins are called significant due to a bias in the missing values. In the 60-10 group there seem to be many more human peptides and protein identified than in the 60-30 group. This makes sense considering the way MS2 spectra are sampled, but unfortunately I don't see a way we could correct for this bias, aside from using matches-between-runs.

statisticalbiotechnology / triqler

MaxQuant Input #10