kreutz-lab / DIMAR

Data-driven selction of an imputation algorithm in R
4 stars 0 forks source link

Large Datasets Speed #6

Open abadgerw opened 5 months ago

abadgerw commented 5 months ago

Thank you for the great tool! I'm trying to apply it to a large dataset of 97 samples and >3,000 proteins. However, it seems to indefinitely hang up at the following step:

[1] "Features with less than 2 data points are removed." Number of proteins with empty entries: 0

Do you have any insight into why it is hanging up and how I can get things to progress forward? It looks like I have adequate memory.

clemenskreutz commented 5 months ago

Dear abadgerw. Can you provide the data and code somehow for testing purpose?

abadgerw commented 5 months ago

Apologies for the delay. I did some further testing and it was a memory issue on my local computer. I am able to run on a cluster computer. However, I did run into a couple things (attaching an example pruned file):

Test.csv

I ran the following code:


mtx<-read.csv("Test.csv",row.names=1)

mtx<-as.matrix(mtx)

Imp<-dimar(mtx, pattern = NULL)

This works but I wanted to have all the intermediate steps saved (including the rank order of the imputation methods)

Therefore, I ran the following:


mtx<-read.csv("Test.csv",row.names=1)

mtx<-as.matrix(mtx)

mtx <- dimarMatrixPreparation(mtx, logflag=FALSE)

methods=c('impSeqRob','impSeq','missForest','imputePCA','ppca','MIPCA','bpca','SVDImpute','kNNImpute','regression','aregImpute','softImpute','MinDet','amelia','SVTImpute','irmi','knn','QRILC','nipals','MinProb','rf','sample','pmm','svdImpute','norm','cart','midastouch','mean','ri')

coef <- dimarLearnPattern(mtx)
ref <- dimarConstructReferenceData(mtx)
sim <- dimarAssignPattern(ref, coef, mtx)

Imputations <- dimarDoImputations(sim, methods)
Performance <- dimarEvaluatePerformance(Imputations, ref, sim, rankby = 'RMSE', RMSEttest = TRUE, group)

Imp <- dimarDoOptimalImputation(mtx, rownames(Performance))

I get the following error during the dimarEvaluatePerformance step:

Error in ttesti[!is.finite(ttesti)] <- NULL : replacement has length zero

Lastly, can I easily test an imputation method that isn't already pre-existing using the pipeline?

abadgerw commented 5 months ago

@clemenskreutz I looked a bit deeper and oddly the pruned file I sent above works without error during the dimarEvaluatePerformance step but the full dataset provides the error mentioned above: Error in ttesti[!is.finite(ttesti)] <- NULL : replacement has length zero

I've attached the larger file attached here for comparison:

Test2.csv

abadgerw commented 5 months ago

@clemenskreutz just wanted to see if you were able to reproduce the error with the datasets above?

abadgerw commented 3 months ago

@clemenskreutz I just wanted to see if you were able to reproduce this issue and whether you had some insights into how to address the issue?

abadgerw commented 3 months ago

@clemenskreutz I just want to circle back to see if you had made any headway in potentially assisting with this so I can move forward with using your tool?

mengerj commented 2 months ago

@abadgerw Thanks for wanting to use the package. I think I fixed the issue, please try if it works for you. It seems that in your data set for some proteins the t test returns infinite values, at least after certain imputation methods. If you want to look closer into why this happens you could clone the repository, load it with devtools::load_all() and within this loop in dimarEvaluatePerformance.R: for (a in 1:length(Imputations)) { # loop over imputation algorithms ... } you can print(im) and add if(!is.finite(htesti$statistic)) print(t) , after ttesti[t] <- htesti$statistic , to check which proteins and which imputation method return these values. I assume the standard deviations of these proteins is zero after imputation due to a certain mechanism.

Regarding your question wether you can add your own methods, you should be able to by adding one into the dimarDoImputationR.R file. But if you have trouble, let me know which method you want to add and I can implement it.

Hope it works, have a good day!

abadgerw commented 2 months ago

@mengerj Thank you so much! I have gone ahead and tried and am running into the following issue. I ran the same code as above to save the intermediate steps that the DIMAR function runs:


mtx<-read.csv("Test2.csv",row.names=1)

mtx<-as.matrix(mtx)

mtx <- dimarMatrixPreparation(mtx, logflag=FALSE)

methods=c('impSeqRob','impSeq','missForest','imputePCA','ppca','MIPCA','bpca','SVDImpute','kNNImpute','regression','aregImpute','softImpute','MinDet','amelia','SVTImpute','irmi','knn','QRILC','nipals','MinProb','rf','sample','pmm','svdImpute','norm','cart','midastouch','mean','ri')

coef <- dimarLearnPattern(mtx)
ref <- dimarConstructReferenceData(mtx)
sim <- dimarAssignPattern(ref, coef, mtx)

Imputations <- dimarDoImputations(sim, methods)
Performance <- dimarEvaluatePerformance(Imputations, ref, sim, rankby = 'RMSE', RMSEttest = TRUE, group)

Imp <- dimarDoOptimalImputation(mtx, rownames(Performance))

When running the dimarLearnPattern step, I get the following warning that I didn't get beforehand: By default DIMAR is assuming that the first half of the columns belong to group 1 and the second half to group 2. If this is not the case, please provide the group vector to the dimarConstructDesignMatrix function.

Given I wasn't using the dimarConstructDesignMatrix function but rather the dimarMatrixPreparation function, how can I specify the required groupings whilst also being able to use the above code to save all the intermediate steps (including the rank order of the imputation methods)?

Thanks for all your help!

abadgerw commented 2 months ago

@mengerj I forgot to mention that the methods I was interested in implementing were:

  1. https://pubmed.ncbi.nlm.nih.gov/31199438/
  2. https://pubmed.ncbi.nlm.nih.gov/29385130/
  3. https://pubmed.ncbi.nlm.nih.gov/15333461/
mengerj commented 2 months ago

@abadgerw thanks for noting, now you can pass a group vector to the dimarLearnPattern function, which calls the dimerConstructDesignMatrix function. If you don't pass anything I changed the default to performing hierarcical clustering to ensure consistensy across the package. Let me know if it works now. Im going to try to implement the requested methods these days. Have a good day!

abadgerw commented 2 months ago

Thanks, @mengerj! I just tested and when I run the following step:


sim <- dimarAssignPattern(ref, coef, mtx)

I get the following warning even though I provided a group vector during the dimarLearnPattern step:

By default DIMAR is using hierarchical clustering (amap::hcluster()) to assign samples to two groups. If group assigments are known, provide a vector of group assigments to be passed to the dimarConstructDesignMatrix function.

Any insights into whether I need to provide the group vector as part of any of the other steps?

mengerj commented 2 months ago

Hi, you should now be able to provide the group option also to the dimarAssignPattern function.

mengerj commented 2 months ago

Regarding the methods you would like to integrate: The GMSimpute method is already available. But when trying to use it I found a problem which I havn't found a great solution for. Internaly the function removes all rows with more than 50% missing values. When using parallel processing to impute several simulated matrices at once, some matrices have more rows that are removed than others and therefore the resulting matrices can't be jointed into a new 3D array. I tried to install the package locally through the CRAN github repository and had some issues. Other attempts at changing the relevant code and overwriting the function also lead to problems due to parallel processing. For now the only solution I can give you is to insert a function I wrote, to remove all rows that have more than 50% MV in ANY of the patterns (3rd dimension of sim) or to set npat = 1 in dimarAssignPattern. The results of the first option alter the simulated matrix and should therefore be applied before running any imputation to ensure comparibility of the methods, but also in this case the strengths and weakness of the different imputation methods with regards to handling rows with more than 50% MV can not be judged. Here is the modified code you can use, based on your example:

' remove rows that have more than 50% missing values in ANY of the patterns

' Otherwise GMSimpute won't work

cutSim <- function(sim, cutOff){ rowIdx_50pNA <- c() for (i in 1:dim(sim)[3]) { input_data <- sim[,,i] row.missing <- rowSums(is.na(input_data)) rowIdx <- which(row.missing >= cutOff*ncol(input_data)) rowIdx_50pNA <- c(rowIdx_50pNA, rowIdx) }

remove these rows from all simulated matrices to ensure equal size

sim <- sim[-rowIdx_50pNA,,] }

mtx<-read.csv("Test.csv",row.names=1)

mtx<-as.matrix(mtx)

mtx <- dimarMatrixPreparation(mtx, logflag=FALSE)

methods=c('GMSimpute')

coef <- dimarLearnPattern(mtx) ref <- dimarConstructReferenceData(mtx) sim <- dimarAssignPattern(ref, coef, mtx)

if ("GMSimpute" %in% methods) { sim <- cutSim(sim, 0.5) }

Imputations <- dimarDoImputations(sim, methods) Performance <- dimarEvaluatePerformance(Imputations, ref, sim, rankby = 'RMSE', RMSEttest = TRUE, group)

Imp <- dimarDoOptimalImputation(mtx, rownames(Performance))

mengerj commented 2 months ago

I can't install the second method from github (if you find a way, let me know). Also for the third method, the link provided with the publication leads to nowwhere. If you find ways to install these methods through the remotes package (they dont seem to be available through CRAN or BioConductor for sure), let me know and I can try again.

abadgerw commented 2 months ago

Thanks, @mengerj!

I think the third method can be implemented via the pcaMethods package: https://www.rdocumentation.org/packages/pcaMethods/versions/1.64.0/topics/llsImpute

I will have a look at the other two and see if I can think of anything.

Thanks again for your help!