nfj1380 / mrIML

Multivariate (multi-response) ensemble learning
https://nfj1380.github.io/mrIML/
Other
6 stars 3 forks source link

Error in data.frame(sp, mod_name, rmse, rsq): arguments imply differing number of rows: 0, 1 #4

Open dosshra opened 2 years ago

dosshra commented 2 years ago

Hello I tried to run mrIML using this script: cl <- parallel::makeCluster(20) future::plan(cluster, workers=cl) X<-as.data.frame(gfdf[,2:13]) Y<-gfdf[,seq(14,61822,100)] model_rf <- rand_forest(trees = 10, mode = "regression", mtry = tune(), min_n = tune()) %>% set_engine("randomForest") yhats_rf <- mrIMLpredicts(X=X,Y=Y, Model=model_rf, balance_data='no', mode='regression', tune_grid_size=5, seed = sample.int(1e8, 1) ) I get a lot of warning during the run:

! Fold05: preprocessor 1/1, model 1/5: The response has five or fewer unique value ! Fold04: internal: A correlation computation is required, but truth is constant

When I run the following code:

ModelPerf <- mrIMLperformance(yhats_rf, Model=model_rf, Y, mode='regression') I get this error:

Error in data.frame(sp, mod_name, rmse, rsq): arguments imply differing number of rows: 0, 1 Traceback:

  1. mrIMLperformance(yhats_rf, Model = model_rf, Y, mode = "regression")
  2. data.frame(sp, mod_name, rmse, rsq)
  3. stop(gettextf("arguments imply differing number of rows: %s", . paste(unique(nrows), collapse = ", ")), domain = NA)

Thank you

nfj1380 commented 2 years ago

Hi there,

Can you provide a reproducible example? I suspect it is to do with the Y variables here. The warnings you get are common during the tuning process - it means for some tuning parameters in the models there wasn't enough signal to calculate r2 properly. Usually these warnings aren't a huge issue.

Cheers,

Nick

dosshra commented 2 years ago

Thank you for the reply. Please see a RData file with X&Y. Y is a matrix of minor allele proportions in 110 populations, and X is the environmental variables of the populations. mrIML.zip

nfj1380 commented 2 years ago

I couldn't recreate your error - mrIMLperformance worked fine for me with your data. When did you install the package?

dosshra commented 2 years ago

I found a problem with my Y matrix. It is running OK now. Thank you

dosshra commented 2 years ago

Hi Using the data that I uploaded, and the code in the original message, It takes 4 hours with 20 CPU to complete. Y is only 110x618 matrix, X includes 12 variables, and I run only 10 trees.
The mrIML installation is from last week on R version 3.6.3. Trying with R 4.1.2 with 7 CPU also took a very long time. What could be the problem?
Thank you

nfj1380 commented 2 years ago

You could try to reduce the tuning grid size (i.e. not test as many hyperparameter combinations). If you are comparing to gradient forests this is why it takes longer - GF doesn't do any model tuning. That said, it took overnight on my machine using 6 cores which is longer than I thought it would given the data dimensions.You could try other algorithms too - SVM could be faster if you are on a time crunch. https://parsnip.tidymodels.org/articles/articles/Models.html. [https://parsnip.tidymodels.org/logo.png]https://parsnip.tidymodels.org/articles/articles/Models.html List of Models • parsniphttps://parsnip.tidymodels.org/articles/articles/Models.html parsnip is a part of the tidymodels ecosystem, a collection of modeling packages designed with common APIs and a shared philosophy. parsnip.tidymodels.org


From: dosshra @.> Sent: Wednesday, March 23, 2022 2:34 PM To: nfj1380/mrIML @.> Cc: Nicholas Fountain-Jones @.>; Comment @.> Subject: Re: [nfj1380/mrIML] Error in data.frame(sp, mod_name, rmse, rsq): arguments imply differing number of rows: 0, 1 (Issue #4)

Hi Using the data that I uploaded, and the code in the original message, It takes 4 hours with 20 CPU to complete. Y is only 110x618 matrix, X includes 12 variables, and I run only 10 trees. The mrIML installation is from last week on R version 3.6.3. Trying with R 4.1.2 with 7 CPU also took a very long time. What could be the problem? Thank you

— Reply to this email directly, view it on GitHubhttps://github.com/nfj1380/mrIML/issues/4#issuecomment-1075879428, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIBFOL2TBY2BRUD6AV53WHLVBKGMHANCNFSM5Q2ZYJHA. You are receiving this because you commented.Message ID: @.***>

This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.

nfj1380 commented 2 years ago

I just tried with xgb and it worked much faster than the rf model for some reason. Might be also be worth a try.

try this:

model_xgb<- boost_tree( trees = 100, tree_depth = tune(), min_n = tune(), loss_reduction = tune(), ## first three: model complexity sample_size = tune(), mtry = tune(), ## randomness learn_rate = tune(), ## step size ) %>% set_engine("xgboost") %>% set_mode("regression")


From: dosshra @.> Sent: Thursday, March 17, 2022 3:22 PM To: nfj1380/mrIML @.> Cc: Nicholas Fountain-Jones @.>; Comment @.> Subject: Re: [nfj1380/mrIML] Error in data.frame(sp, mod_name, rmse, rsq): arguments imply differing number of rows: 0, 1 (Issue #4)

Thank you for the reply. Please see a RData file with X&Y. Y is a matrix of minor allele proportions in 110 populations, and X is the environmental variables of the populations. mrIML.ziphttps://github.com/nfj1380/mrIML/files/8281359/mrIML.zip

— Reply to this email directly, view it on GitHubhttps://github.com/nfj1380/mrIML/issues/4#issuecomment-1070311396, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIBFOLYZJXNH2VEELO4GN43VAKXRXANCNFSM5Q2ZYJHA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.

dosshra commented 2 years ago

Thank you for the suggestion, xgb do run faster. However: trying to continue exploring the data using this code: yhats_rf <- mrIMLpredicts(X=X,Y=Y, Model=model_rf, balance_data='no', mode='regression', tune_grid_size=5, seed = sample.int(1e8, 1) ) ModelPerf <- mrIMLperformance(yhats_rf, Model=model_rf, Y, mode='regression') VI <- mrVip(yhats=yhats_rf, X=X) resulted in error.

dosshra commented 2 years ago

HI I was able to run the RF regression for the whole dateset of >60k SNP using mtry=4, trees = 10 . It took 6 hrs with 20 CPU and more than 130GB of RAM. However, the flashlight analysis: flashlightObj <- mrFlashlight(yhats_rf, X, Y, response = "multi", mode='regression') profileData_pd <- light_profile(flashlightObj, v = "november") #partial dependencies mrProfileplot(profileData_pd , sdthresh =0.1) Did not finish after 24 hours. Thank you

nfj1380 commented 1 year ago

I think that would be close to a record for number of SNPs in a mrIML analysis -I'm glad it worked even if it was intensive.

With the VI problem can you provide more information? Also, I have updated that code recently - so try reinstalling from github.

Cheers,

Nick


From: dosshra @.> Sent: Thursday, March 31, 2022 4:54 PM To: nfj1380/mrIML @.> Cc: Nicholas Fountain-Jones @.>; Comment @.> Subject: Re: [nfj1380/mrIML] Error in data.frame(sp, mod_name, rmse, rsq): arguments imply differing number of rows: 0, 1 (Issue #4)

HI I was able to run the RF regression for the whole dateset of >60k SNP using mtry=4, trees = 10 . It took 6 hrs with 20 CPU and more than 130GB of RAM. However, the flashlight analysis: flashlightObj <- mrFlashlight(yhats_rf, X, Y, response = "multi", mode='regression') profileData_pd <- light_profile(flashlightObj, v = "november") #partial dependencies mrProfileplot(profileData_pd , sdthresh =0.1) Did not finish after 24 hours. Thank you

— Reply to this email directly, view it on GitHubhttps://github.com/nfj1380/mrIML/issues/4#issuecomment-1084104590, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIBFOL54IVDGVHMYCT6OVL3VCU4Y7ANCNFSM5Q2ZYJHA. You are receiving this because you commented.Message ID: @.***>

This email is confidential, and is for the intended recipient only. Access, disclosure, copying, distribution, or reliance on any of it by anyone outside the intended recipient organisation is prohibited and may be a criminal offence. Please delete if obtained in error and email confirmation to the sender. The views expressed in this email are not necessarily the views of the University of Tasmania, unless clearly intended otherwise.