statOmics / MSqRob

Robust statistical inference for quantitative LC-MS proteomics
Other
33 stars 11 forks source link

Argument "useful_properties" must only contain column names of the featureData slot. #49

Closed antortjim closed 6 years ago

antortjim commented 6 years ago

Hi there

First of all, thanks for developing such a useful tool and supporting moFF with Import2MSnSet.

I am trying to set up a quantification pipeline using the Compomics workflow (searchGUI, peptideShaker and moFF). I obtained a moFF file output for the MaxLFQ dataset (available https://www.ebi.ac.uk/pride/archive/projects/PXD000279) and I implemented a Levenberg-Marquandt minimisation similar to that explained in the paper, to normalize the XICs. That way, I get a moff file with 2+6 columns. However, when I try to preprocess the MSnSet

peptides <- import2MSnSet(file = file.path(output_moff, "peptide_summary_intensity_moFF_run_xic_norm.tab"),
                          filetype = "moFF")

runs <- paste0(rep(c("H", "L"), times=3), "_", rep(1:3, each = 2))
replicate <- rep(1:3, each = 2)  
exp_annotation <- data.frame(runs = runs, replicate = replicate)

peptides_proc <- preprocess_MSnSet(MSnSet = peptides, accession = "prot",
                                   exp_annotation = exp_annot, smallestUniqueGroups = T, split = ", ",
                                   useful_properties = c("prot", "peptide"))

I get this error

Error in preprocess_MSnSet(MSnSet = peptides, accession = "prot", exp_annotation = exp_annot, : Argument "useful_properties" must only contain column names of the featureData slot.

I set useful_properties to the name of the first two columns in the file, storing the protein group and the peptide sequence respectivally. Defaulting to NULL also gives the error. These are the only 2 slots under FeatureData > data in the peptides object. I saw https://github.com/statOmics/MSqRob/issues/32 and I realised I had stripped sumIntensity_ from the column names, but still adding it again did not fix it either.

What could be the problem? I can share my file if needed. Moreover, does it make sense to preprocess the files, or can actually the sample fractionation be considered an extra effect? In that case, would it be fixed or random? Run effects are considered random, so I guess fractions should be too. I don't know why though.

Thanks beforehand!

Best regards Antonio

ludgergoeminne commented 6 years ago

Hi Antonio

Thank you for your interest in MSqRob. The peptides object is an object of class MSnSet (from the MSnbase package). It has three slots: (1) exprs containing the peptide intensities, (2) pData, containing the experimental design and (3) fData containing all the other info.

The error you get says that at least one of the inputs for useful_properties is not in colnames(Biobase::fData(peptides)). Can you check if prot and peptide are in there? Maybe they are in there in a different name? If they are, then it might be a bug in MSqRob and it would be best to share your file or a part of it so I can reproduce the error.

Fractionated data is always more difficult to analyze because some peptides might be seen in multiple fractions. There are multiple ways to deal with this, but I never did an in-depth comparison of what works best, so I can only give a few suggestions:

Probably there are other, and maybe better ways to deal with fractions, but as I said, I never really dug very deeply into that.

I hope this already helps. Let me know if my suggestions worked and if you have any further problems.

Best regards

Ludger

antortjim commented 6 years ago

Dear Ludger

Thanks for your fast answer!

I fixed my issue by making sure that the columns indicating the peptide sequence and the protein group were called exactly as in the script (peptide and prot) AND by making sure that the remaining column names started all with sumIntensity_. That is the standard moFF format , however, I got rid of this prefix because I tried doing some post-processing of moFF output to aggregate the fractions. So now it looks like this:

peptide prot    sumIntensity_H1 sumIntensity_L1 sumIntensity_H2 sumIntensity_L2 sumIntensity_H3 sumIntensity_L3
AAAAAAAAAPAAAATAPTTAATTAATAAQ   H0YLW0, P37108  1573337.079219784       1994662.005153791       1603094.9510681275      1248147.6472639325      955686.467827684        2624950.4267368615
AAAAAAAAAVSR    A0A0A6YYC7, Q96JP5      97522.0954266896        39615.435497852 57855.730039537804      42243.945208599995      55982.8299799038        78311.825593076
AAAAAAALQAK     H3BM89, H3BU31, P36578  349789.54148773977      218380.9192745797       157244.3415556377       29060.7322931258        116076.67504581221      63499.19497407581
AAAAAADPNAAWAAYYSH      A0A087WTP3, M0R3J3, Q92945      0.0     16388.5031340625        0.0     1775.6499351178002 

I tried dealing with fractionation by implementing the Levenberg-Marquandt algorithm described in the MaxLFQ paper (http://www.mcponline.org/content/13/9/2513.long) in Python (with scipy.optimize.root). However I didn't have much success, as this is how the log2 fold change distribution looks for E. coli proteins (which should be around log2(3) = 1.58) and Homo sapiens, (which should be around log2(1)=0).

histogram_fraction_normalized

If I do what you propose, which is inputting the moFF file directly and considering fraction a random effect, I get a much better looking result, since the separation between the 2 organisms is much clearer.

histogram

However I still find many E. coli proteins with an estimate of 0. This could be due to not just an error in the quantification, but also missing identifications in the upstream analysis. Also there's a few Homo sapiens proteins with positive estimates, albeit they are really few. Finally, I think MSqRob's tendency to assign estimates around 0 is actually "cheating" when benchmarking the tool with this dataset since one of the target estimates (for HS) is 0.

So I would say the best way to go when dealing with fractionated data for now is considering it a random effect, as you said. Nevertheless, I think it is improvable since the results from the original MaxLFQ paper (Figure 3 I) show cleaner normal distributions.

Also please let me know if you have any ideas on how to improve the result, or if you need clarification on how the dataset is.

Thanks once again for your time

Cheers

Antonio

ludgergoeminne commented 6 years ago

Hi Antonio

I'm happy you were able to solve the issue.

The ridge regression in MSqRob, which is causing the tendency to assign estimates around 0, is there to avoid declaring proteins with very little evidence towards differential abundance wrongly as differentially abundant. Instead of "wildly" assigning an estimate to a protein with very little data, MSqRob recognizes that there is not enough information in the data and sets the estimate close to 0.

This is conceptually equivalent to putting a normally distributed "prior" around 0 on the data. You then actually assume, before you have seen the data, that the estimate must be around zero, and only if there is enough evidence for differential abundance in the data, will your estimate be further away from 0.

This will inevitably introduce a bit of bias in the data. E.g. if your true differential abundance is at 1.58, MSqRob might give on average an estimate of 1.5. However, the advantage of this estimate is that this estimate will be much less variable (to give an extreme example: if you would profile 1000 protein that have a true log fold change of 5, what would you prefer: an unbiased method that assigns them estimates between -995 and 1005, but has an average estimate of 5; or a biased method which would assign them estimates between 3.5 and 5.5, with an average of 4.5?) Of course you would then prefer the last one, because if you pick a single estimate, changes are much higher that it will be closer to the true value of 5 with the last method.

I agree that if you want to benchmark MSqRob on its FC estimates, you cannot use it for the Homo sapiens proteins, but you can still do it for the E. coli proteins. I personally find it very difficult to compare the plot you made here to the histograms in the paper, also because in the paper, replicate groups were filtered for 2 out of 3 valid values in the proteinGroups file. I also see no reason why the histogram of the estimates for all the proteins should follow a normal distribution. A normal distribution can be expected if you sample observations of a phenomenon where there is purely random noise. However, there are many processes that make the log fold change estimates non random: proteins interact with eachother and can be translated or degraded in response to up- or downregulation of other proteins, there is ionization competition between co-eluting peptides, and so on. So in my opinion the shape of the estimates' distribution doesn't say anything about the quality of the fold change estimates.

I think it might be clearer if you would redo the analysis with the same preprocessing protocol and make for example boxplots or violin plots of the estimates for each method and draw a horizontal line for the true fold change. If you then compare MSqRob to MaxLFQ, I would expect the E. coli estimates of MSqRob to be a bit below the true value on average, but much less variable. (although there might also be the impact of moFF; if you only want to assess MSqRob, you should use the peptides.txt file from the original author's dataset PXD000279)

Finally, I personally think the best way to benchmark MSqRob is to compare the rankings in terms of p-value. This is because the p-value is the measure of statistical significance on which the proteins should be ranked (i.e. the lower the p-value, the higher the probability that the protein is truly differentially abundant if you have no prior knowledge).

Cheers

Ludger

antortjim commented 6 years ago

Hi Ludger

Thousand thanks for such a comprehensive answer!

Now I clearly understand the meaning of the high abundance of 0 estimates. I agree that the distributions don't necessarily have to be normally distributed as you said. However, how could one correct for the E. coli proteins with estimates of 0? What do you mean that the replicate groups were filtered for 2 out of 3 valid values in the proteinGroups file? Do you mean that only protein groups with known intensities in at least 2 of the 3 replicates in each condition were considered?

I will do what you said over the weekend and share it here :+1: I agree that the p-value information should be taken into account.

I have a couple more questions regarding MSqRob but maybe I should open a new issue :) Thanks!

ghost commented 6 years ago

Hi Antonio

You're welcome!

I don't think it is necessary to correct for E coli. proteins with estimates of 0. The estimate of 0 just indicates that there is not enough information in the data to declare the protein differentially abundant. In frequentist statistics, one proposes a null hypothesis that is clearly defined (e.g. the log fold change is equal to zero). Then, we collect data in order to find evidence against the null hypothesis. If the evidence is very unlikely to be observed under the null hypothesis (e.g. if we would theoretically redo our experiment many times, then in less than 5% of our experiments we would find a result that is as extreme or more extreme than the one we just observed), we don't believe anymore that this initial assumption was true and we call our result statistically significant (at e.g. the 5% significance level).

However, when the result is not statistically significant, we cannot reject our null hypothesis. This is a much weaker conclusion because it can mean that the null hypothesis is indeed true, but it can also mean that the null hypothesis is not true, but we just didn't collect enough evidence against it (unless we e.g. did some statistical power calculations upfront, but these also rely on certain assumptions, e.g. regarding the minimal effect size).

A nice analogy would be that I would like to prove that there are gnomes living in my garden. My null hypothesis would be that there are 0 gnomes in my garden and my alternative hypothesis would be that there would be at least 1 gnome living in my garden. If I go out and try to collect gnomes, and I find a gnome, I will be very happy (hello Nobel Prize!). However, if I can't find any gnome, it could be that there are indeed none, but I could also have used a wrong way to look for them, they could have been on holiday, etc... This is why it is so difficult to prove that something does not exist.

Mostly, in science, we always try to disprove the null hypothesis, or in our case: we want to find proteins that are different from 0. Thus, I think it is not needed to correct for the zeros in the E. coli proteins. The zeros indicate that there is not enough evidence in the data to reject the null hypothesis for these proteins.

With "the replicate groups were filtered for 2 out of 3 valid values in the proteinGroups file" I indeed meant that the authors in the MaxLFQ paper only retained the peptides for which per group in 2 of the 3 replicates an intensity was observed, or at least, this is how I interpreted this sentence in their paper: "Replicate groups were filtered for two out of three valid values and averaged". This means that they very likely removed a large part of their dataset (as in my experience, the largest part of the dataset has more missing values than that). I didn't look into detail what kind of preprocessing they did exactly, but when you do a comparison, you should make sure to use the same preprocessing, so ideally you should try to reproduce their preprocessing pipeline and apply it to both methods or use a preprocessing pipeline on your own for both methods.

Of course you could also argue that the preprocessing is an inherent part of a method, but I think this is not a valid argument for this kind of comparison because you don't look at which proteins you find and which you don't, you look at how nicely the distributions are separated. And of course, the distributions will be much more separated if you only use proteins for which there is a lot of data.