saezlab / MetaProViz

R-package to perform metabolomics pre-processing, differential metabolite analysis, metabolite clustering and custom visualisations.
https://saezlab.github.io/MetaProViz/
GNU General Public License v3.0
8 stars 0 forks source link

Using DMA function on data in natural log scale #76

Closed nikvan closed 7 months ago

nikvan commented 7 months ago

Hello, I really like the DMA function you guys have put together! I have data that has already undergone batch correction, imputation, and natural log transformation. Is it suitable to use the DMA function on this data? Are some algorithms more suitable than others? Thank you!

ChristinaSchmidt1 commented 7 months ago

Hi, great to hear you like the function!

For STAT_pval: Given your data have been log transformed you enforce normal distribution and hence t.test (if you have two conditions to compare) or aov =anova (if you have multiple conditions to compare) would be the standard test to use. You can also use lmFit = limma for multiple comparison, but here you would need to remove the log2 transformation as this is done within the function prior to running limma. As a side note, limma is also useful if you have data that are not imputed.

For STAT_padj: This depends a bit on you feature space (=how many metabolites have you detected). For low numbers of features (<100), I would use BH = Benjamini Hochberg. Having said this, there are lots of papers that check the different adjustment methods on different feature spaces and others are also applicable.

Does this adresses your question sufficiently?

nikvan commented 7 months ago

Thank you very much @ChristinaSchmidt1! Super helpful.

I did run into an issue when trying what you recommend.

DMA_1 <- DMA( Input_data, Input_SettingsFile_Sample, Input_SettingsInfo = c(conditions = "GROUP_NAME_2", numerator = "CMPD1_BSL_H", denominator = "CMPD1_BSL_L"), STAT_pval = "t.test", STAT_padj = "fdr", Input_SettingsFile_Metab = NULL, OutputName = "", CoRe = FALSE, VST = FALSE, Save_as_Plot = "svg", Save_as_Results = "csv", Plot = TRUE, Folder_Name = NULL )

My input data and settings data should conform to the correct format.

I get an error when running the DMA command above that: "Error in MetaProViz:::Shapiro(Input_data = Input_data, Input_SettingsFile_Sample = Input_SettingsFile_Sample, : The GROUP_NAME_2 column selected as Conditions in Input_SettingsInfo was not found in Input_SettingsFile_Sample. Please check your input"

Everything seems to be good at first glance. Anything I am missing here?

Screenshot 2024-01-29 at 6 41 36 PM

Thank you!

ChristinaSchmidt1 commented 7 months ago

Thanks for pointing this out! You can either change your column GROUP_NAME_2 to "Conditions", but I have also pushed a small change to the DMA function. So if you reinstall the package (now v.1.0.1) it should now also work as you have described above.

Let me know if that fixes your issue and thanks again for bringing this to my attention.

nikvan commented 7 months ago

Thanks, @ChristinaSchmidt1! Changing the column name from GROUP_NAME_2 to Conditions got me past that error.

I am encountering another error which I am not too sure what to make of. Screenshot 2024-01-30 at 11 54 34 AM

ChristinaSchmidt1 commented 7 months ago

Ok, lets just briefly check if your input files are in the correct format. Below is an example, could you check if your input is formated in the same way and let me know?

# Set seed for reproducibility
set.seed(123)

# Number of samples and features
num_samples <- 20
num_features <- 666

# Create rawdata with random values
data <- data.frame(matrix(rnorm(num_samples * num_features), ncol = num_features))
rownames(data) <- paste0("Sample", 1:num_samples)

# Display dimensions of rawdata
dim(data)

# Create meta data with condition column
meta <- data.frame(
  Conditions = rep(c("CMPD1_BSL_H", "CMPD1_BSL_L", "Condition3", "Condition4"), each = 5),
  stringsAsFactors = FALSE
)

rownames(meta) <- rownames(data)

# Display dimensions of meta
dim(meta)

Afterwards you can use this as input:

DMA_1 <- DMA( Input_data=data, 
              Input_SettingsFile_Sample=meta, 
              Input_SettingsInfo = c(conditions = "Conditions", 
                                     numerator = "CMPD1_BSL_H", 
                                     denominator = "CMPD1_BSL_L"),
              STAT_pval = "t.test", 
              STAT_padj = "fdr", 
              Input_SettingsFile_Metab = NULL, 
              OutputName = "", 
              CoRe = FALSE, 
              VST = FALSE,
              Save_as_Plot = "svg",
              Save_as_Results = "csv",
              Plot = TRUE,
              Folder_Name = NULL)
nikvan commented 7 months ago

Hi @ChristinaSchmidt1 , Thanks very much for the example. I was able to get that to run successfully. I am going through and trying to find where my input dataframes might be causing issue. So far, I have checked:

  1. Making sure the row names in Input_data are the same as Input_SettingsFile_Sample
  2. deleting columns in Input_SettingsFile_Sample so that there is just the column "Conditions"
  3. Making sure rownames and column names in both are valid object names (using make.names for this)
  4. Checking to make sure all the columns in Input_data are numeric.
  5. Grouping my conditions together in the Input_SettingsFile_Sample. It looked as below before: Screenshot 2024-01-31 at 9 53 27 AM

Now looks like:

Screenshot 2024-01-31 at 9 54 39 AM

Anything else I should check? Thank you!

ChristinaSchmidt1 commented 7 months ago

Hi Nikvan,

I am glad the code runs on your machine. About the points to check:

Than it should work. If it still does not work could you please send me the error message you are now experiencing after all those changes?

nikvan commented 7 months ago

Thanks! Sorry about the delay. Seems to be working!