saezlab / MetaProViz

R-package to perform metabolomics pre-processing, differential metabolite analysis, metabolite clustering and custom visualisations.
https://saezlab.github.io/MetaProViz/
GNU General Public License v3.0
9 stars 0 forks source link

Preprocessing - Potential outliers (95%) and PCA #10

Closed ChristinaSchmidt1 closed 1 year ago

ChristinaSchmidt1 commented 1 year ago

Today we said to only add one collumn with the outliers and keep it as simple as possible and leave the final choice with the user.

You mentioned in your meeting that you found an issue with the PCA after Hotellins you wanted to fix, can you specifiy again what this was about?

dprymidis commented 1 year ago

Yes, I will do so with an example. We have 10 samples. 1 to 8 have 0 for a metabolite A and 9 and 10 have some value. Everything works correctly. Hotellings for whatever reason identifies sample 9 and 10 as outliers in the 2nd round as removes them. In the 3rd round PCA cannot be performed because metabolite A now has zero variance. Thus, before each Outlier identfication round before PCA we have to check all metabolites to have non zero variance. If one happens to have zero variance we have to remove it. In this case:

  1. We should give a warning that this metabolite (metabolite A) was removed in that filtering round (for example 2) because of zero variance.
  2. Export a text with the metabolite and the round it was removed.

In the ProcessedData: We keep all the samples and have the outlier column. Here, we would also keep that metabolite. Then, the user could remove the metabolite using the exported text if the also want to filter the outlier samples of that round (for example 2nd) or keep the metabolite (do nothing) if they decide to keep the samples identified as outliers in the 2nd round.

What do you think?

ChristinaSchmidt1 commented 1 year ago

I think that sounds reasonable. We should also include int he warning that if they would have chosen to do missing value imputation, they shouldnt have this issue in fist place as no value can be 0.

dprymidis commented 1 year ago

With the MVI we change the zero values to the half minimum. Its the same value for all zeros so this issue still occurs since the variance remains zero. I fixed this the way proposed above in the latest commit.

ChristinaSchmidt1 commented 1 year ago

I have been checking the preprocessing function (line 237-246) " zero variance metabolites". Here the metabolites that have zero variance are removed, so they wont cause the issues you mentioned above. I would do this after the TIC normalisation, so those metabolites remain in the TIC normalised DF. And then I would modify the message stating that those metabolites with zero variance were not taken into account for the outlier detection and on the PCA plots due to the explained issues.

Ultimately, I would return all metabolites (also zero variance once) ideally MVI and TIC normalised (if the user chooses the standard settings), but give the messages which metabolites where not included in the outlier detcteion/PCA due to zero variance.

dprymidis commented 1 year ago

The initial zero variance metabolite identification is moved after TIC and before Outlier detection (Actually the first step in every loop of outlier detection). All the metabolites are normalized and present in the processed data. A warning stating "Metabolites with zero variance have been identified in the data. As scaling in g PCA cannot be applied when features have zero variace, these metabolites are not taken into account for the outlier detection and the PCA plots.") is shown and a also a text file with the Zero variacne metabolites df is produced.

ChristinaSchmidt1 commented 1 year ago

Amazing job, thanks! I will close the issue then :)