MetaProViz::Preprocessing - CoRE=TRUE: Media blank samples - Enable removal of single metabolite values?

ChristinaSchmidt1 commented 1 year ago

We measure the mean and sd of the blank samples and return a message with the persentage of high variable metabolites in the blank samples.

I am not sure what would be a decent cut-off for SD to take actions, meaning give error message asking the user to check certain blanks versus warning message. You are right that there should not be any variance really.

You said: "for nowI used coefficient of variation CV>1 to give a warning of high variance based on this. I will come back at this."

dprymidis commented 1 year ago

As it is now, we check each metabolite one at a time and report the persentage of metabolites that have "high" variance with respect to the "arbitrary" measure we applied (the CV>1). This has the problem that the CV (relative variability) is a relative measure and is a bit arbitrary and that also the threshold is arbitrary.

What we want to do is to identify whether the blank samples are "the same" or not.

What we could do is pairwise t-tests or ANOVA to find out if there are any samples that are different from the rest. But we cant really use a t -test on the samples as the means are the same because the data are TIC normalized.

So, to check if the samples are the same or if any is different we can do the following:

One sample a t-test to compare one sample against the group for each feature for all samples.

This results in a dataframe with a pvalue for each sample against the rest for each metabolite.

From here we can measure how many metabolites in a sample are statistically different (p<0.05) from the rest and get the percentage (rowSums divided by the number of metabolites).

This results in a value for each blank sample. This value represents the percentage of metabolites of this sample that are statistically different from the rest of the sample. But again here we have the same problem with CV. Where to put the threshold.

Here we do another t-test to see if a sample has a statistically different amount of statistically different metabolites. Because you would expect some metabolites to be different because for example of the detection limit and the adundance of the other metabolites in the samples but these differences should be small and roughly the same. So if a sample is indeed different it would have many more differences from the rest of the samples than any other sample in general against the group.

In this example I have 5 blank samples and 1 sample which is not. We can see that the % of metabolites that are different fro MS51-06 is 80% while the rest is ~10-20%. The final t-test shows the p.val of MS51-06 to be smaller than 0.05 while the rest is higher.

What do you think? Does this make sence?

dprymidis commented 1 year ago

Another thought is that we could use again the HotellingT2 here

dprymidis commented 1 year ago

About the " We cant really use a t -test on the samples as the means are the same because the data are TIC normalized."

I tested it again using the code in preprocessing and I do get the same rowmeans for the data_TIC (for each sample after TIC normalization).

I am not sure what went wrong before, I think in excel they were different because in excel it only has the geometric mean which uses the product instead of the sum.

ChristinaSchmidt1 commented 1 year ago

ok, thanks for checking! In this case we probably should just calculate the standard error (SE) of the replicates for each metabolites. I am not sure if here we could set a good threshold (?)

dprymidis commented 1 year ago

After todays meeting we decided to do:

If number of media samples ==0 then error("we need to have at least 1 media sample").
If ==1 then warning" we dont do statistics because we have only 1 media. We assume they calculated the mean beforehand") adn skip the next steps.
if == 2 or 3 then skip the variance testing and "outlier" sample testing (the next steps)..
IF >3 then Do the same as the pooled sampled and report the high variant metabolites. But do not exclud anything. Give a warning" these metabolties were found as having high variance, please check the pooled samples if you need to remove these metabolites".
Also take log, shapiro test and then ANOVA to test if any sample is different from the rest. IF a sample is identified as different report it and do not include it in the subsequent statistics.

ChristinaSchmidt1 commented 1 year ago

To add for 3.: With 2 or 3 we can do a variance testing, to understand if there are features (metabolites) that appear to be really variable. If so we can report those. Otherwise, I completely agree we can not do any outlier testing (as in weather a media sample is an outlier).

dprymidis commented 1 year ago

There is a issue.

for feature filtering we ignore the blank samples.
Then for MVI we do half minimum based on Conditions.

It can happen (and it happens is the toy data) that we keep a metabolite that has NA values for all the blank samples because of 1. Then in 2 we get inf and thigs dont work.

What do we do here? I think we should use the half minimum of all the dataset to fill these NAs is the blank samples.

ChristinaSchmidt1 commented 1 year ago

Just to be sure, blank samples as in the media control samples for the CoRe:

I agree thats something we should change. The problem is that the user can of course choose not to do any feature filtering or MVI for many reasons. I think here we should just treat the blank samples in the best way possible irrespectively what the user choose to do for the rest of the matrix. Meaning:

We process all samples as the user requested (feature filtering, MVI,...)
If features are removed, we remove them from the blank.
We do QC of the blank samples as you noted down above, but additionally we will check for NAs. Here of course it depends how many samples (%) are NAs. Over a certain threshold we should at least report those. Yet with media it can happen that we have NAs in all samples because the metabolite is not part of the media, but the cells have secreted it and hence it is in the other samples. Because of this, I would set NAs to 0 in the blanks and continue calculation. But of course if not all blanks are NA we need to consider that the metabolite is close to the detection limit and hence I would do MVI per feature here, just focusing on the blank values. As mentioned we can then reprot the metabolites where the blanks have >20% of NAs (80%-fitering rule). And in general if there are any NAs in the blank we can message the user to inform them.

Would this make sense to you?

dprymidis commented 1 year ago

Yes, blank samples are the media control in CoRe. I havent change the name yes. The "blank" samples will be "Media_Control" in the Experimental_design will will also be renames as InputSettingsFile as stated in issue 47.

If I understood correctly:

We process the data and if features are filtered based on 80% re remove them from the Media_Controls.
We will do MVI for the Media_Control samples regardless of what the user selects for MVI=T/F.
We check for NAs and report "NAs were found in the Media_Control samples.
If all Media_Control have NA for a metabolite then we set the values to 0.
IF Media_Control has at least 1 value and NAs we take half minimum per feature for the NAs and if there are more than 20% of NAs we report the metabolite are having a lot of NAs in the Media_Controls

If this is correct then yes I totally agree.

ChristinaSchmidt1 commented 1 year ago

Yes this is exactly what I had in mind, thanks for writing the bullet points, thats much clearer than my text :)

dprymidis commented 1 year ago

This is done except the "Outlier testing". Anova or pairwise tests of the samples dont work here because they look for a difference in the mean and the data are normalized.

ChristinaSchmidt1 commented 1 year ago

Yes thats correct. Would it be an option that we do something similar as with the pool samples? Maybe we can calculate the std. dev as TIC normalises per sample and now we would aim to check across sample varaiubility for one feature.

dprymidis commented 1 year ago

We do the same checks as in pool samples. Meaning check for high variant metabolites among samples. But here its missing an update I did there which I plan to also add. In pool samples I also added the Quartile coefficient of variation https://www.nature.com/articles/s41598-023-31711-8. I am readings some papers for this and I will make an update.

Since with the above tests we measure which metabolites have high variance I am thinking to somehow identify which samples are responsible for this and the make test (I guess a categorical test) to conclude if these differences are enough to say that the samples are different.

ChristinaSchmidt1 commented 1 year ago

Ok that sounds sensible. Let me know what you decide once you finsihed your reading :) And as with the other messages in the function you can also add the paper into the message for this test.

dprymidis commented 1 year ago

This will be a kinda big post discussing 2 things. Which methods to use to measure the dispersion of samples and the about finding the “outlier samples” responsible for the high variation observed.

About dispersion measures

A) About the use of Standard error are a dispersion measure. Based on this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487226/) I would suggest not to use the standard error in this case as it is not a direct measure of dispersion. SEM quantifies uncertainty in estimate of the mean. So, how close to the real mean we are (the SD of the mean) based on the sample mean we have. I think we could use SEM if we did something like leave one out cross- validation (where we would leave out one sample at a time and remeasure SEM) and look for a "big" drop in the SEM. This would mean that the sample that was left out was introducing this high uncertainty to finding the real mean and thus, might be potentially problematic.

B) I was reading some papers (this in particular https://www.frontiersin.org/articles/10.3389/fams.2019.00043/full#B31) when I realized I was making a mistake. I was searching for a robust measure to measure the dispersion across different distributions. So to choose this measure when we have normally distributed or the other if not. CV works best for normally distributed but other measure using quartiles for example are better for other distributions. The issue here is that yes there are other better measures than CV with respect to expressing the true dispersion of the data than CV. But we are not trying to find the best approximation of the true/real dispersion. What we want to find is the possible outliers of a specific set of samples which are supposed to be the same (of very low variation between them). Also, the more robust measures are less sensitive to outliers. We use CV exactly because its sensitive to outliers. Thus ,I propose to just and only use the coefficient of variation as a measure of dispersion.

Now on how to identify the outliers. a. We measure the CV b. For the metabolites that have high CV we calculate the z-score for each sample. c. We remove the sample with the highest absolute z-score d. We remeasure the CV and if we find again high cv we do i and ii again. e. For each sample we make a contingency table. Samples for columns and outliers yes or no as rows.

Outliers/ samples Sample A Sample B Sample C Sample D Sum Yes 15 0 2 4 21 No 85 100 98 96 379 Sum 100 100 100 100 400

    f.  We perform fishers exact test to find if and which sample is an outlier.

(Instead of the z-score exclusion we could do leave-1-out crossvalidation but i think it goes too far). To me this seems solid approach. What do you think about all this?

ChristinaSchmidt1 commented 1 year ago

Thank you very much for the detailed summary, you really did a great job dissecting this issue.

A) This makes sense, especially in regards of the issue that we discussed before that often there are no more than 3 blank sample (even more likely that people only process 2). The option you describe with leaving out a sample an remeasure would make sense if we would always have a higher number of blank samples, but in hindsight that this will most likely not be the case, I would not proceed down this route. Yet, this is a very valuable stream of discussion, which can be helpful for us to explain our choices.

B) If I remember correctly, the CV is the ratio between the standard deviation and the mean. So it is still connected to the mean, but this is fine as we really want to find outliers (mean is more sensitive to outliers than median). So yes, we can go ahead and measure the CV.

Here I have some questions:

2.b.: You mean you would calculate the z-score for the metabolite across the blank samples (not all samples)? In the end the z-score just tells us how many standard deviation from the mean the data value is away, so I agree that we could use it to find the value that is furthest away (2c), yet we have to have at least 3 blank samples here.
2.e.: I would suggest we do an automated decision, meaning we remove the blank that is off in most of the metabolites. If the user doesnt want this they can pass the mean blank sample and we wont do anything.

One point: As you would do this for each metabolite if I understand correctly, what happens if for different metabolites different blanks are sugested to be outliers?

dprymidis commented 1 year ago

Yes CV is sd/mean. You can multiply that with 100 to get the relative standard deviation (RSD), which is the same as a percentage.

ChristinaSchmidt1 commented 1 year ago

Do you have 5 min to meet opn zoom, to briefly discuss the points?

dprymidis commented 1 year ago

Ok this is done! IF exportQCplot == TRUE it returns the CoRe_media_contigency_table.

Here, I added a sample as a blank to see if it is identified as different and it seems to be workings as expected.

ChristinaSchmidt1 commented 1 year ago

Amazing, well done! We can check in the vignette for CoRe how this performs and add some of the thoughts/discussions of this issue into the explanation of the vignette.

dprymidis commented 1 year ago

Here is a 'short' summary on how we process the CoRe Media samples. I am not sure what to put in the Vignette from all of this because it seems a lot, thus I am putting it here to have it at one place.

Preprocessing of CoRe-Media_samples

At least 1 Media_sample has to be present in the data. Otherwise we return an error.
For the CoRe we use a CoRe_norm_factor for normalisation which has to be added as a column in the Input_SettingsFile. The CoRe_norm_factor must have the same length as the number of samples. If no CoRe_norm_factor was found we give a warning and set it to 1 for all samples.

Media_samples are ignored in the Feature filtering step as we expect most metabolites would not be present there.
For MVI we take the Media_samples, calculate and report if NAs were found in the Media_samples and also seperately the metabolites with NAs in more than 20% of the samples. If all values are NAs then we set the values to 0. If at least 1 value was observed then we impute using Half minimum per feature. We do the above regardless if the user has selected MVI=TRUE/FALSE
After TIC we process the Media_samples
- If there is only 1 Media_sample we assume and report that the user has porobably already summed their Media_samples and we cannot check the consistency of the CoRe_media samples.
- If there are more than 1 Media_samples we calculate the Coefficient of Variation for each metabolite. We take a threshold of CV=1 and report metabolites with high variance among the Media_samples.
- If there are more than 3 Media_samples and as long as there are metabolites with high variability (CV>1) we mark and temporary remove the sample with the highest distance from the mean for that particular metabolite and recalculate the CV.
- After no high CV was measured we sum the number of marked (responsible for high variance) and not marked metabolites (low variance) for each sample and make a contigency table and we export it. Then, we use Fishers exact test to identify if and which of the Media_samples is significantly different from the rest.
- If a Media_sample was identified as different we do not use it for the next calculation (Media_sample mean).
- For each metabolite we calculate the mean of all samples.
- We subtract the mean of the Media_samples from each metabolite value accordingly in our dataframe.
- FInally, we multiply the values of each sample with the corresponding value of the CoRe_norm_factor

dprymidis commented 1 year ago

Some discussion:

Media_samples: We do a CoRe experiment in order to find out which and how much metabolites are absorbed by or excreted from the cells. When we run LC-MS and we measure our samples we actually measure the initial abundace of a metabolite in the enviroment plus the change in its abundance that occured because of the cells. To find out what our cells have actually done we must remove this initial abundance of the metabolites. To do this we must also measure the Media_Samples which are samples without any cells and we subtract these values from the samples.

CoRe_norm_factor : As a CoRe experiment is underway the number of cells tends to change. This has an effect of the abundace of the absorbed/excreted metabolites. To account for that we use the CoRe_norm_factor which is numerical value accounting for the growth rate that has been obtained from a growth curve or growth factor using the ratio of cell count/protein quantification at the start point to cell count/protein quantification at the end point.

Feature filtering: Media_samples are ignored in the Feature filtering step. Cells synthesise and excrete metabolites that are different from the ones they need to survive (Metabolites inclyded in the growth media). So, we expect that many metabolites would not be present in the Media_Samples (their values would be NA) and would result in removing them. This woud lead to losing valuable information.

MVI: We always imput missing values in the Media_Samples as we need to have values ro subtract from the actual samples. For a metabolite that has at least 1 value we impute using HM per feature as we expect that the missing values are due to the low adundace but the metaboltie is there since it was picked up by one Media_sample. IF all values are NA we assume the metabolite is not present so we replace NAs with 0.

CV: The accurate measurement of the initial metabolite abundace is crucial as it affects the whole experiment. Media samples, since they consist of only the growth media, are expected to be identical. This translates into the variation of a metabolite to be very small. To check if this is indeed the case for each metabolite we measure the Coefficient of variation. A metabolite with almost identical values will have very small CV while a metabolite with very different abundances will have high CV. Also, because of the nature of LC-MS it can happen that there are outlier values and CV is more sensitive to outliers than other more robust variation measures. This enables us to identify if a metabolite is unstable and which sample is responsible for it are remove it prior to Media_sample mean calculation.

dprymidis commented 1 year ago

There is 1 issue which I would like to discuss because I am not sure. Maybe you can discuss it with the people there because its important for the correctness of the outlier procedure. Its about fishers test.

We use fishers exact test as we have 2x2 categorical data and generally have low number of samples and we expect very sparse data. So Fishers exact test seems appropriate. However, Fishers test does not work if we have structural zeros. Check structural zeros here: https://www.quality-control-plan.com/StatGuide/conting_anal_ass_viol.htm#Structural%20zeroes. In this guide they give as an example of structural zeros the male and the ovarian cancer, which is something that cannot happen. FIchers test cannot handle these cases. In our case we expect that all the metabolites will have low CV,thus having zero in the high_var but its not "mandatory" to have zero, as it can happen that they are not.

My question is: Are our zeros structural zeros? or not?

If they are not, all is fine. If they are, then we cannot use fishers exact test and we have to find another way to find the "outllier samples". I thought of another possible way which is to do something like GSEA. Meaning that we start at zero and check for each sample each metabolite one at a time and we give +1 for each high_var or -1 for each low_var and then use Kolmogorov–Smirnov test. But first we have to find out if we can use Fishers test or not.

ChristinaSchmidt1 commented 1 year ago

Thabnks for the great summary, I really like it. I can move some of the points into the vignette this week as I will try to finish this up. In yout comment Some discussion: For the MVI point, this is correct and for us we can also add some biological explanation. Indeed, dependent on the culture media used and the concentration of supplemented FBS some metabolites will not be present or have a low concentration that misses the detection limit. This is why we will treat msiing values as true zeros, whilst in the case that we detected a metabolite in one of the media blank samples we assume it is close to dtection limit and hence perform the missing value imputation (missing not at random).

About the last comment on the fishers exact test: I will discuss this today and add my response afterwards. One question here: You refer to the outlier detection of the media blank control?!

dprymidis commented 1 year ago

Great thanks! Yeah about fishers exact test, I indeed mean the media control outlier detection!

dprymidis commented 1 year ago

After todays meeting we should proceed with the following:

Test Outliers with different scenarios
- Make some tests using different contigency tables ie. 00006, 11116, 121326
- For a metaboltie 1 sample has a value and 0 for the rest. Do MVI, check CV and outlier detection
- Check the relationship of CV with the MVI
Make CV loop stop when we reach a certain amount of samples (3)
Check. Remove the samples responsible for high CV when they are not outliers from the mean calculation. So if metab1 has high CV and sample A is the furthest from the mean, but sample A is not an outlier. Should the value of sample A for metab1 not be included in the mean calculation?

ChristinaSchmidt1 commented 1 year ago

I had a discussion on this with Aurelien and there is some feedback here:

Add a QC plot specific for the blank samples: Meaning adding them onto the PCA to see if they cluster together. So we could do another QC PCA plot with the blank samples (of course only if there is more than 1 sample).
Export the CV values of the plank samples, so the user can review them if needed
He would not remove any sample unless it is not clustering with the other balnk samples on the PCA and additionally causes high CVs and when remoced all CVs are smaller. This is of course something that is difficult to automate, so I will still seek to investigate about the fishers exact test and structural zeros.

dprymidis commented 1 year ago

Ah nice! When you say blank you mean the media_control samples right?

Because then I would also add the media sample in a pca along with all the other samples in one QC pca. I say this because I have tested this with the pool samples and it only makes sense using all the samples. I think the same goes with the media samples even though I would expect (not totaly sure here) that they would seem like an outlier group compared to the samples. Which means that they might drive the separation in one axis, which would mean that the interesting variation might be hindered because we added the media_samples in the pca, which makes me think we might have to indeed add another pca plot. Lets me check this and how things look and I''ll come back to it.

Yes for the CV values we print them but yes I agree we can also export them in a csv or excel file
hmm Ok. Again if we plot in a pca only the media_samples then we will see a separation since pca will try to find the directions with the most variance. So we will see some kind of separation even though the axis will have values of 0.1. IF the plot the media along with the actual samples when we will almost always get a good clustering (since we expect the media samples to be a lot different from the samples) except when something went very wrong in a media_sample but we would also pick this up from the CVs. So I wouldnt go with pca here. Also because indeed that would be difficult to automate and I think we would go back to hotellings, which again I dont think it would be able to pick up such small differences. except if we used a confidence of 0.9999.

ChristinaSchmidt1 commented 1 year ago

Yes exactly, the media control samples :) And yes the PCA plot with the added media_control samples (=blank samples) - we can name it QC blank samples and colour code the blank samples in red colour and all the other samples in black (or something like this). Also yes, the media samples usually seem like a outlier group, whilst the pool samples would be expected to be in the middle of all samples (as they are the pool of all samples).

About your last point: Yes I agree hotellins is not a solution and the PCA in itself is at least giving us a visual representation in case something went really wrong, so its a good control in any case. For the automation we still need a solution tough.

One more thing: shall we report the CV in percent? (or is this already converted, sorry havent checked)?

dprymidis commented 1 year ago

No the CV is a dataframe with metabolites as rows, and 2 columns: numerical CVs and High_var = TRUE/FALSE depending on the threshold (which is 1). We print how many metabolites were found with high CV but a percentage here would be better.

ChristinaSchmidt1 commented 1 year ago

Ok, so maybe we can multiple the numerical CVs with 100 and export CV [%] instead.

dprymidis commented 1 year ago

Ok but then we will export 1 value. Wouldnt it make sense also to export the table at least of the high variable metabolites? or their names? so that the user can do something with it. If its just the percentage of high CV metabolites, meaning this one value, I think printing it in a message would be more apropriate than saving it in a file. On the other hand, saving it keeps it there forever but still not much you can do with it.

ChristinaSchmidt1 commented 1 year ago

I think we talk about different things I mean if we have a cv of 0.1 this is 10% of variation, but I think I mixed this with SD. So never mind.

dprymidis commented 1 year ago

So here we had to do 5 things.

1) Test Outliers with different scenarios:

Make some tests using different contigency tables ie. 00006, 11116, 121326. Each number is the number of high var metabolites for each sample of the 5 samples measured:

 -  00004 with total 70 metabls. So Only 1 sample with  ~5%.  This is found as an outlier
 -  00001 with total 70 metabls.  No outlier
 -  00002 with total 70 metabls.  Sample 5 in found as outlier

 - 11115 Sample 5 is found as outlier
 - 11114  No outliers
 - 22225  No outliers
 - 22226  Sample 5

 - 32036 Sample 5 outlier
 - 32035 No outliers
 - 32136 No outliers
 - 32137 Sample 5

To my eyes it works as I would expect it to work.

For a metaboltie 1 sample has a value and 0 for the rest. Do MVI, check CV and outlier detection. and also - Check the relationship of CV with the MVI:
- For a total of 5 samples
  - For a sample i kept 1 value (765139228) and put the other 4 to zero. Did the MVI and checked the CV. Since the values are 0 and not NA, they stay 0 and CV=2.2 So the value 765139228 gets removed and the mean is 0.
    - If we keep the same value but put the 0 to NA, then the NA values are imputed around 397940990. The CV is 0.37.
      I also checked the CVs for different number of NAs. Here an example
    - 1 NA, CV = 0.24
    - 2 NAs,CV= 0.32
    - 3 NAs, CV =0.37
    - 4 NAs, CV =0.37

So about the relationship of MVI/NAs and CV I think the more NAs the CV increases but at some point it stabilizes. This is because: From 0 to more NAs the CV increases as the imputed values have high sd. However at some point the general mean moves closer to the imputed values and they get have lower sd. Then having more NAs should decrease the CV as the mean would go closer to the imputed values, thus the general sd shoud decrease. However as this indeed happens the general mean also decreases which would make the CV increase. These 2 effect would make te CV to "stabilize".

I guess there could be a perfect analogy of missing values and recorded values around 50-50 which will cause the mean to be in the middle of both, thus making the sd to be high, while also decreasing the mean and thus making the CV to be high. But I am not sure. I think I could try to actually model what would happen but might take a while.

2) Make CV loop stop when we reach a certain amount of samples (3)

  This cannot be done as its not a loop. IF we have more than 3 samples we do CVs and  then fishers test and then we remove the samples found as outleirs from the media_samples.

3) Check. Should we remove the samples responsible for high CV when they are not outliers from the mean calculation. So if metab1 has high CV and sample A is the furthest from the mean, but sample A is not an outlier. Should the value of sample A for metab1 not be included in the mean calculation

This, as I see it, is more of a theoretical argument and I went a bit more general. Our goal here is to be as precise as possible and as close to the 'real' mean which we dont know. So we measure a number of samples and go for reproducibility/consistency of the values. I would say that we remove completely a sample when it is found as outlier by fishers test (due to high number of hugh CVs cause by a sample) because we have a reason to do it. For example one could argue that even though this sample is found to have only 5 ot 10 different values for metabolites compared to the rest, still there might be extremities to other metabolites that are of smaller magnitude that we are not able to detect. For example the metabolite abundances of this sample might always be higher than the other samples but still the total CV might be <1, so all the means might be a little higher. My point here is that if fishers test found this sample as not related to the rest we have enough evidence to remove it as not being consistent with the rest of the samples.

Now about the removing a specific value from the mean calculation because of high cv. Again I would vote for removing it. (Here, I am not 100% sure of the first dtep in my reasoning) Because of the nature of metabolomics we can have random effects that might change the value of a metabolite thus also resulting in a different abundance when compared to the group, thus resulting in high CV. If fishers test did not find the sample as outlier then the sample is considered fine and this difference in abundance is considered a random effect. So, we remove this value from the mean calculation but keep the sample.

4) Add a QC plot specific for the blank samples: Meaning adding them onto the PCA to see if they cluster together. So we could do another QC PCA plot with the blank samples (of course only if there is more than 1 sample). A QC_Media_samples plot is exported in the QC filder adn also added in the QC_plots list, colored by Sample_type = 'CoRe_media' or 'Sample'

5) Check about the fishers exact test and structural zeros. Based on this (how they showcase their structural zeros) I dont think our zeros are structural. So fishers test should be fine.

ChristinaSchmidt1 commented 1 year ago

Thanks for doing the tests (point 1). This tells us some important things:

If one out of 5 samples has a value, we should impute the other 4 samples, as otherwise it is likely that the measured sample is removed as the outlier. Given that we need to assume that this metabolite is really low on the detection limit as one sample has been measured it would be wrong to have 0 as for a completely missing metabolite.
For the above we can expect a high CV as you describe, which will get smaller if we have more samples. In a scenario like this having a high CV should not lead to removal of samples as this is a difficult situation where it can not be clear biologically speaking

Towards point 3:

I agree, we should remove a blank media sample if it is identified as an outlier across many metabolites with a high CV (like being responsible for over 80% of the high CVs)
In case of a blank media sample just being responsible for one metabolites high CV its a different story. Here I would not remove the sample unless there would be another reason (e.g. it is an outlier on the PCA). Because for me removing would mean removing the sample completely, which will also affect the other means. Even if we do as you suggested and only remove the sample from the eman of the affected metabolite, this would affect the scenario described above when having 1 detected value and otherwise 0/NA. Yet, I understand that your argument is different here and can also be sensible and thus we should discuss this in our next meeting. Maybe it would be an option to leave this choice to the user by giving them a parameter for this scenario.

Towards point 4: Yes thats perfect :)

Towards point 2 and 5: Ok. I also think the structural zeros are not happening for us from how I understand the explanation, yet I can not be sure.

So action points: Maybe now you can add the removal of a whole sample if it causes a high mean for a certain percentage of metabolites with a high mean and the extra PCA QC plot. For the second point of 3 (blank sample is outlier in only one metabolite and hence removed only in this one and not completely), we need to check if we really can just remove the sample from one metabolite and not the whole sample. (If this is something we are allowed to do, you could add the removal with a parameter TRUE/FALSE).

Let me know what you think about this. Also, is there anything from the testing you have done you think we should add into the manuscript methods section like a benchmarking?

dprymidis commented 1 year ago

Ok, based on your comment I would say we do: MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that.

The resoning is that we remove the samples that were found as outliers by fishers test on the tables of CVs. With this we remove the samples that were found as having disproportional persentage of "being responsible for a metabolite high CV" when compared to the rest of the Media_control samples and we have evidence for it from fishers test. We dont do anything else. As it could fall in the category of "overfitting".

Also about some benchmarking, I could maybe plot the function of MVIs and resulting CVs.

As for the 3b point: We could add a parameter called: Remove_Media_test = "fishers" or "CV". fishers would be the default, with the above reasoning and CV would be removing all sample values causing high CV with the reasoning of consistency against random effects, but indeed this needs a lot more investigation if it is a substantial argument. But for now I would say leave this for later.

ChristinaSchmidt1 commented 1 year ago

Sounds good to me. So I would suggest you do the points above "MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that.".

For the benchmarking, we could start a benchmarking .rmd file in the vignette folder. You can move your tests regarding the pre-processing in there and we can extend this with other benchmarking-like things we did or do for other functions.

For the last point (3b): I agree lets leave this for now and once the above is done, we can rename the issue and just add this as the remaining point for the future.

dprymidis commented 1 year ago

SO the ""MVI, PCA, CV, contigency tables, Fishers" is done. I am adding the test of NA and CV in a Benchmark.Rmd and I will put it in the vignette.

dprymidis commented 1 year ago

As an upgrade: We should decide if removing single metabolite values that cause high CV when the sample responsible is not identified as outlier using fishers test.

ChristinaSchmidt1 commented 1 year ago

Thanks!

ChristinaSchmidt1 commented 1 year ago

To do's:

[ ] Decide if removing single metabolite values that cause high CV when the sample responsible is not identified as outlier using fishers test. Meaning removing the pool sample only from the mean calculation of the metabolite where it causes a high CV, but not from the others.

dprymidis commented 1 year ago

Something that I noticed here by looking at this plot. This is the benchmark plot for the percentage of NA effect on the CV. On the x-axis, we see the different percentages on added NA and on the y-axis the resulting CV. Here, we see that the max NA percentage is 50-60% and for the 60% the max CV is ~ 90%. Assuming that our data are good, I would say that CV =1 as a cut off seems good because we see that the effect of NAs on CV isnt enough to make a metabolite having high CV. So for a metabolite to have CV >1, something else besides having NAs has to have happen.

saezlab / MetaProViz

MetaProViz::Preprocessing - CoRE=TRUE: Media blank samples - Enable removal of single metabolite values? #6