Open ChristinaSchmidt1 opened 1 year ago
As it is now, we check each metabolite one at a time and report the persentage of metabolites that have "high" variance with respect to the "arbitrary" measure we applied (the CV>1). This has the problem that the CV (relative variability) is a relative measure and is a bit arbitrary and that also the threshold is arbitrary.
What we want to do is to identify whether the blank samples are "the same" or not.
What we could do is pairwise t-tests or ANOVA to find out if there are any samples that are different from the rest. But we cant really use a t -test on the samples as the means are the same because the data are TIC normalized.
So, to check if the samples are the same or if any is different we can do the following:
This results in a dataframe with a pvalue for each sample against the rest for each metabolite.
This results in a value for each blank sample. This value represents the percentage of metabolites of this sample that are statistically different from the rest of the sample. But again here we have the same problem with CV. Where to put the threshold.
In this example I have 5 blank samples and 1 sample which is not. We can see that the % of metabolites that are different fro MS51-06 is 80% while the rest is ~10-20%. The final t-test shows the p.val of MS51-06 to be smaller than 0.05 while the rest is higher.
What do you think? Does this make sence?
Another thought is that we could use again the HotellingT2 here
About the " We cant really use a t -test on the samples as the means are the same because the data are TIC normalized."
I tested it again using the code in preprocessing and I do get the same rowmeans for the data_TIC (for each sample after TIC normalization).
I am not sure what went wrong before, I think in excel they were different because in excel it only has the geometric mean which uses the product instead of the sum.
ok, thanks for checking! In this case we probably should just calculate the standard error (SE) of the replicates for each metabolites. I am not sure if here we could set a good threshold (?)
After todays meeting we decided to do:
To add for 3.: With 2 or 3 we can do a variance testing, to understand if there are features (metabolites) that appear to be really variable. If so we can report those. Otherwise, I completely agree we can not do any outlier testing (as in weather a media sample is an outlier).
There is a issue.
It can happen (and it happens is the toy data) that we keep a metabolite that has NA values for all the blank samples because of 1. Then in 2 we get inf and thigs dont work.
What do we do here? I think we should use the half minimum of all the dataset to fill these NAs is the blank samples.
Just to be sure, blank samples as in the media control samples for the CoRe:
I agree thats something we should change. The problem is that the user can of course choose not to do any feature filtering or MVI for many reasons. I think here we should just treat the blank samples in the best way possible irrespectively what the user choose to do for the rest of the matrix. Meaning:
Would this make sense to you?
Yes, blank samples are the media control in CoRe. I havent change the name yes. The "blank" samples will be "Media_Control" in the Experimental_design will will also be renames as InputSettingsFile as stated in issue 47.
If I understood correctly:
If this is correct then yes I totally agree.
Yes this is exactly what I had in mind, thanks for writing the bullet points, thats much clearer than my text :)
This is done except the "Outlier testing". Anova or pairwise tests of the samples dont work here because they look for a difference in the mean and the data are normalized.
Yes thats correct. Would it be an option that we do something similar as with the pool samples? Maybe we can calculate the std. dev as TIC normalises per sample and now we would aim to check across sample varaiubility for one feature.
We do the same checks as in pool samples. Meaning check for high variant metabolites among samples. But here its missing an update I did there which I plan to also add. In pool samples I also added the Quartile coefficient of variation https://www.nature.com/articles/s41598-023-31711-8. I am readings some papers for this and I will make an update.
Since with the above tests we measure which metabolites have high variance I am thinking to somehow identify which samples are responsible for this and the make test (I guess a categorical test) to conclude if these differences are enough to say that the samples are different.
Ok that sounds sensible. Let me know what you decide once you finsihed your reading :) And as with the other messages in the function you can also add the paper into the message for this test.
This will be a kinda big post discussing 2 things. Which methods to use to measure the dispersion of samples and the about finding the “outlier samples” responsible for the high variation observed.
A) About the use of Standard error are a dispersion measure. Based on this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487226/) I would suggest not to use the standard error in this case as it is not a direct measure of dispersion. SEM quantifies uncertainty in estimate of the mean. So, how close to the real mean we are (the SD of the mean) based on the sample mean we have. I think we could use SEM if we did something like leave one out cross- validation (where we would leave out one sample at a time and remeasure SEM) and look for a "big" drop in the SEM. This would mean that the sample that was left out was introducing this high uncertainty to finding the real mean and thus, might be potentially problematic.
B) I was reading some papers (this in particular https://www.frontiersin.org/articles/10.3389/fams.2019.00043/full#B31) when I realized I was making a mistake. I was searching for a robust measure to measure the dispersion across different distributions. So to choose this measure when we have normally distributed or the other if not. CV works best for normally distributed but other measure using quartiles for example are better for other distributions. The issue here is that yes there are other better measures than CV with respect to expressing the true dispersion of the data than CV. But we are not trying to find the best approximation of the true/real dispersion. What we want to find is the possible outliers of a specific set of samples which are supposed to be the same (of very low variation between them). Also, the more robust measures are less sensitive to outliers. We use CV exactly because its sensitive to outliers. Thus ,I propose to just and only use the coefficient of variation as a measure of dispersion.
Outliers/ samples Sample A Sample B Sample C Sample D Sum Yes 15 0 2 4 21 No 85 100 98 96 379 Sum 100 100 100 100 400
f. We perform fishers exact test to find if and which sample is an outlier.
(Instead of the z-score exclusion we could do leave-1-out crossvalidation but i think it goes too far). To me this seems solid approach. What do you think about all this?
Thank you very much for the detailed summary, you really did a great job dissecting this issue.
A) This makes sense, especially in regards of the issue that we discussed before that often there are no more than 3 blank sample (even more likely that people only process 2). The option you describe with leaving out a sample an remeasure would make sense if we would always have a higher number of blank samples, but in hindsight that this will most likely not be the case, I would not proceed down this route. Yet, this is a very valuable stream of discussion, which can be helpful for us to explain our choices.
B) If I remember correctly, the CV is the ratio between the standard deviation and the mean. So it is still connected to the mean, but this is fine as we really want to find outliers (mean is more sensitive to outliers than median). So yes, we can go ahead and measure the CV.
Here I have some questions:
2.b.: You mean you would calculate the z-score for the metabolite across the blank samples (not all samples)? In the end the z-score just tells us how many standard deviation from the mean the data value is away, so I agree that we could use it to find the value that is furthest away (2c), yet we have to have at least 3 blank samples here.
2.e.: I would suggest we do an automated decision, meaning we remove the blank that is off in most of the metabolites. If the user doesnt want this they can pass the mean blank sample and we wont do anything.
One point: As you would do this for each metabolite if I understand correctly, what happens if for different metabolites different blanks are sugested to be outliers?
Yes CV is sd/mean. You can multiply that with 100 to get the relative standard deviation (RSD), which is the same as a percentage.
Do you have 5 min to meet opn zoom, to briefly discuss the points?
Ok this is done! IF exportQCplot == TRUE it returns the CoRe_media_contigency_table.
Here, I added a sample as a blank to see if it is identified as different and it seems to be workings as expected.
Amazing, well done! We can check in the vignette for CoRe how this performs and add some of the thoughts/discussions of this issue into the explanation of the vignette.
Here is a 'short' summary on how we process the CoRe Media samples. I am not sure what to put in the Vignette from all of this because it seems a lot, thus I am putting it here to have it at one place.
Preprocessing of CoRe-Media_samples
At least 1 Media_sample has to be present in the data. Otherwise we return an error.
For the CoRe we use a CoRe_norm_factor for normalisation which has to be added as a column in the Input_SettingsFile. The CoRe_norm_factor must have the same length as the number of samples. If no CoRe_norm_factor was found we give a warning and set it to 1 for all samples.
Media_samples are ignored in the Feature filtering step as we expect most metabolites would not be present there.
For MVI we take the Media_samples, calculate and report if NAs were found in the Media_samples and also seperately the metabolites with NAs in more than 20% of the samples. If all values are NAs then we set the values to 0. If at least 1 value was observed then we impute using Half minimum per feature. We do the above regardless if the user has selected MVI=TRUE/FALSE
After TIC we process the Media_samples
Some discussion:
Media_samples: We do a CoRe experiment in order to find out which and how much metabolites are absorbed by or excreted from the cells. When we run LC-MS and we measure our samples we actually measure the initial abundace of a metabolite in the enviroment plus the change in its abundance that occured because of the cells. To find out what our cells have actually done we must remove this initial abundance of the metabolites. To do this we must also measure the Media_Samples which are samples without any cells and we subtract these values from the samples.
CoRe_norm_factor : As a CoRe experiment is underway the number of cells tends to change. This has an effect of the abundace of the absorbed/excreted metabolites. To account for that we use the CoRe_norm_factor which is numerical value accounting for the growth rate that has been obtained from a growth curve or growth factor using the ratio of cell count/protein quantification at the start point to cell count/protein quantification at the end point.
Feature filtering: Media_samples are ignored in the Feature filtering step. Cells synthesise and excrete metabolites that are different from the ones they need to survive (Metabolites inclyded in the growth media). So, we expect that many metabolites would not be present in the Media_Samples (their values would be NA) and would result in removing them. This woud lead to losing valuable information.
MVI: We always imput missing values in the Media_Samples as we need to have values ro subtract from the actual samples. For a metabolite that has at least 1 value we impute using HM per feature as we expect that the missing values are due to the low adundace but the metaboltie is there since it was picked up by one Media_sample. IF all values are NA we assume the metabolite is not present so we replace NAs with 0.
CV: The accurate measurement of the initial metabolite abundace is crucial as it affects the whole experiment. Media samples, since they consist of only the growth media, are expected to be identical. This translates into the variation of a metabolite to be very small. To check if this is indeed the case for each metabolite we measure the Coefficient of variation. A metabolite with almost identical values will have very small CV while a metabolite with very different abundances will have high CV. Also, because of the nature of LC-MS it can happen that there are outlier values and CV is more sensitive to outliers than other more robust variation measures. This enables us to identify if a metabolite is unstable and which sample is responsible for it are remove it prior to Media_sample mean calculation.
There is 1 issue which I would like to discuss because I am not sure. Maybe you can discuss it with the people there because its important for the correctness of the outlier procedure. Its about fishers test.
We use fishers exact test as we have 2x2 categorical data and generally have low number of samples and we expect very sparse data. So Fishers exact test seems appropriate. However, Fishers test does not work if we have structural zeros. Check structural zeros here: https://www.quality-control-plan.com/StatGuide/conting_anal_ass_viol.htm#Structural%20zeroes. In this guide they give as an example of structural zeros the male and the ovarian cancer, which is something that cannot happen. FIchers test cannot handle these cases. In our case we expect that all the metabolites will have low CV,thus having zero in the high_var but its not "mandatory" to have zero, as it can happen that they are not.
My question is: Are our zeros structural zeros? or not?
If they are not, all is fine. If they are, then we cannot use fishers exact test and we have to find another way to find the "outllier samples". I thought of another possible way which is to do something like GSEA. Meaning that we start at zero and check for each sample each metabolite one at a time and we give +1 for each high_var or -1 for each low_var and then use Kolmogorov–Smirnov test. But first we have to find out if we can use Fishers test or not.
Thabnks for the great summary, I really like it. I can move some of the points into the vignette this week as I will try to finish this up.
In yout comment Some discussion
: For the MVI point, this is correct and for us we can also add some biological explanation. Indeed, dependent on the culture media used and the concentration of supplemented FBS some metabolites will not be present or have a low concentration that misses the detection limit. This is why we will treat msiing values as true zeros, whilst in the case that we detected a metabolite in one of the media blank samples we assume it is close to dtection limit and hence perform the missing value imputation (missing not at random).
About the last comment on the fishers exact test: I will discuss this today and add my response afterwards. One question here: You refer to the outlier detection of the media blank control?!
Great thanks! Yeah about fishers exact test, I indeed mean the media control outlier detection!
After todays meeting we should proceed with the following:
I had a discussion on this with Aurelien and there is some feedback here:
Ah nice! When you say blank you mean the media_control samples right?
Because then I would also add the media sample in a pca along with all the other samples in one QC pca. I say this because I have tested this with the pool samples and it only makes sense using all the samples. I think the same goes with the media samples even though I would expect (not totaly sure here) that they would seem like an outlier group compared to the samples. Which means that they might drive the separation in one axis, which would mean that the interesting variation might be hindered because we added the media_samples in the pca, which makes me think we might have to indeed add another pca plot. Lets me check this and how things look and I''ll come back to it.
Yes for the CV values we print them but yes I agree we can also export them in a csv or excel file
hmm Ok. Again if we plot in a pca only the media_samples then we will see a separation since pca will try to find the directions with the most variance. So we will see some kind of separation even though the axis will have values of 0.1. IF the plot the media along with the actual samples when we will almost always get a good clustering (since we expect the media samples to be a lot different from the samples) except when something went very wrong in a media_sample but we would also pick this up from the CVs. So I wouldnt go with pca here. Also because indeed that would be difficult to automate and I think we would go back to hotellings, which again I dont think it would be able to pick up such small differences. except if we used a confidence of 0.9999.
Yes exactly, the media control samples :) And yes the PCA plot with the added media_control samples (=blank samples) - we can name it QC blank samples and colour code the blank samples in red colour and all the other samples in black (or something like this). Also yes, the media samples usually seem like a outlier group, whilst the pool samples would be expected to be in the middle of all samples (as they are the pool of all samples).
About your last point: Yes I agree hotellins is not a solution and the PCA in itself is at least giving us a visual representation in case something went really wrong, so its a good control in any case. For the automation we still need a solution tough.
One more thing: shall we report the CV in percent? (or is this already converted, sorry havent checked)?
No the CV is a dataframe with metabolites as rows, and 2 columns: numerical CVs and High_var = TRUE/FALSE depending on the threshold (which is 1). We print how many metabolites were found with high CV but a percentage here would be better.
Ok, so maybe we can multiple the numerical CVs with 100 and export CV [%] instead.
Ok but then we will export 1 value. Wouldnt it make sense also to export the table at least of the high variable metabolites? or their names? so that the user can do something with it. If its just the percentage of high CV metabolites, meaning this one value, I think printing it in a message would be more apropriate than saving it in a file. On the other hand, saving it keeps it there forever but still not much you can do with it.
I think we talk about different things I mean if we have a cv of 0.1 this is 10% of variation, but I think I mixed this with SD. So never mind.
So here we had to do 5 things.
1) Test Outliers with different scenarios:
Make some tests using different contigency tables ie. 00006, 11116, 121326. Each number is the number of high var metabolites for each sample of the 5 samples measured:
- 00004 with total 70 metabls. So Only 1 sample with ~5%. This is found as an outlier
- 00001 with total 70 metabls. No outlier
- 00002 with total 70 metabls. Sample 5 in found as outlier
- 11115 Sample 5 is found as outlier
- 11114 No outliers
- 22225 No outliers
- 22226 Sample 5
- 32036 Sample 5 outlier
- 32035 No outliers
- 32136 No outliers
- 32137 Sample 5
To my eyes it works as I would expect it to work.
So about the relationship of MVI/NAs and CV I think the more NAs the CV increases but at some point it stabilizes. This is because: From 0 to more NAs the CV increases as the imputed values have high sd. However at some point the general mean moves closer to the imputed values and they get have lower sd. Then having more NAs should decrease the CV as the mean would go closer to the imputed values, thus the general sd shoud decrease. However as this indeed happens the general mean also decreases which would make the CV increase. These 2 effect would make te CV to "stabilize".
I guess there could be a perfect analogy of missing values and recorded values around 50-50 which will cause the mean to be in the middle of both, thus making the sd to be high, while also decreasing the mean and thus making the CV to be high. But I am not sure. I think I could try to actually model what would happen but might take a while.
2) Make CV loop stop when we reach a certain amount of samples (3)
This cannot be done as its not a loop. IF we have more than 3 samples we do CVs and then fishers test and then we remove the samples found as outleirs from the media_samples.
3) Check. Should we remove the samples responsible for high CV when they are not outliers from the mean calculation. So if metab1 has high CV and sample A is the furthest from the mean, but sample A is not an outlier. Should the value of sample A for metab1 not be included in the mean calculation
This, as I see it, is more of a theoretical argument and I went a bit more general. Our goal here is to be as precise as possible and as close to the 'real' mean which we dont know. So we measure a number of samples and go for reproducibility/consistency of the values. I would say that we remove completely a sample when it is found as outlier by fishers test (due to high number of hugh CVs cause by a sample) because we have a reason to do it. For example one could argue that even though this sample is found to have only 5 ot 10 different values for metabolites compared to the rest, still there might be extremities to other metabolites that are of smaller magnitude that we are not able to detect. For example the metabolite abundances of this sample might always be higher than the other samples but still the total CV might be <1, so all the means might be a little higher. My point here is that if fishers test found this sample as not related to the rest we have enough evidence to remove it as not being consistent with the rest of the samples.
Now about the removing a specific value from the mean calculation because of high cv. Again I would vote for removing it. (Here, I am not 100% sure of the first dtep in my reasoning) Because of the nature of metabolomics we can have random effects that might change the value of a metabolite thus also resulting in a different abundance when compared to the group, thus resulting in high CV. If fishers test did not find the sample as outlier then the sample is considered fine and this difference in abundance is considered a random effect. So, we remove this value from the mean calculation but keep the sample.
4) Add a QC plot specific for the blank samples: Meaning adding them onto the PCA to see if they cluster together. So we could do another QC PCA plot with the blank samples (of course only if there is more than 1 sample). A QC_Media_samples plot is exported in the QC filder adn also added in the QC_plots list, colored by Sample_type = 'CoRe_media' or 'Sample'
5) Check about the fishers exact test and structural zeros. Based on this (how they showcase their structural zeros) I dont think our zeros are structural. So fishers test should be fine.
Thanks for doing the tests (point 1). This tells us some important things:
Towards point 3:
Towards point 4: Yes thats perfect :)
Towards point 2 and 5: Ok. I also think the structural zeros are not happening for us from how I understand the explanation, yet I can not be sure.
So action points: Maybe now you can add the removal of a whole sample if it causes a high mean for a certain percentage of metabolites with a high mean and the extra PCA QC plot. For the second point of 3 (blank sample is outlier in only one metabolite and hence removed only in this one and not completely), we need to check if we really can just remove the sample from one metabolite and not the whole sample. (If this is something we are allowed to do, you could add the removal with a parameter TRUE/FALSE).
Let me know what you think about this. Also, is there anything from the testing you have done you think we should add into the manuscript methods section like a benchmarking?
Ok, based on your comment I would say we do: MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that.
The resoning is that we remove the samples that were found as outliers by fishers test on the tables of CVs. With this we remove the samples that were found as having disproportional persentage of "being responsible for a metabolite high CV" when compared to the rest of the Media_control samples and we have evidence for it from fishers test. We dont do anything else. As it could fall in the category of "overfitting".
Also about some benchmarking, I could maybe plot the function of MVIs and resulting CVs.
As for the 3b point: We could add a parameter called: Remove_Media_test = "fishers" or "CV". fishers would be the default, with the above reasoning and CV would be removing all sample values causing high CV with the reasoning of consistency against random effects, but indeed this needs a lot more investigation if it is a substantial argument. But for now I would say leave this for later.
Sounds good to me. So I would suggest you do the points above "MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that.".
For the benchmarking, we could start a benchmarking .rmd file in the vignette folder. You can move your tests regarding the pre-processing in there and we can extend this with other benchmarking-like things we did or do for other functions.
For the last point (3b): I agree lets leave this for now and once the above is done, we can rename the issue and just add this as the remaining point for the future.
SO the ""MVI, PCA, CV, contigency tables, Fishers" is done. I am adding the test of NA and CV in a Benchmark.Rmd and I will put it in the vignette.
As an upgrade: We should decide if removing single metabolite values that cause high CV when the sample responsible is not identified as outlier using fishers test.
Thanks!
To do's:
Something that I noticed here by looking at this plot. This is the benchmark plot for the percentage of NA effect on the CV. On the x-axis, we see the different percentages on added NA and on the y-axis the resulting CV. Here, we see that the max NA percentage is 50-60% and for the 60% the max CV is ~ 90%. Assuming that our data are good, I would say that CV =1 as a cut off seems good because we see that the effect of NAs on CV isnt enough to make a metabolite having high CV. So for a metabolite to have CV >1, something else besides having NAs has to have happen.
We measure the mean and sd of the blank samples and return a message with the persentage of high variable metabolites in the blank samples.
I am not sure what would be a decent cut-off for SD to take actions, meaning give error message asking the user to check certain blanks versus warning message. You are right that there should not be any variance really.
You said: "for nowI used coefficient of variation CV>1 to give a warning of high variance based on this. I will come back at this."