Open shlid007 opened 1 year ago
Hi, Could you provide more information? What do you mean by saying "The optimal training dataset doesn't resolve into a data frame"? What's the error message?
Hi - if the dataset is too large, then it doesn't render from simple functions like ncols(). Instead, it issues a NULL.
I was unable to determine the exact threshold of what the tipping point is for dataset size. As a result, I was advised to replicate the study using the original data. I have the MIMIC-III dataset so am in process of trying to replicate study to make sure my copy of the code is sound. Any other suggestions, feel free to share. Thank you.
I also ran into issues if I didn't use complete cases. Trying to impute resulted in non-numeric errors. Using complete cases or removing columns with missing values solved the issue. Which is strange because the documentation for AutoScore says missing values should be OK.
Hi - I posted some more on Github --> view it on GitHubhttps://github.com/nliulab/AutoScore/issues/5#issuecomment-1581168121
Thank you!
-Sue
From: XIE FENG @.> Sent: Wednesday, June 7, 2023 12:31 PM To: nliulab/AutoScore @.> Cc: Sue Lhymn @.>; Author @.> Subject: Re: [nliulab/AutoScore] Optimal Training Dataset doesn't render if too large (Issue #5)
Hi, Could you provide more information? What do you mean by saying "The optimal training dataset doesn't resolve into a data frame"? What's the error message?
— Reply to this email directly, view it on GitHubhttps://github.com/nliulab/AutoScore/issues/5#issuecomment-1581168121, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3MFT2JJ2KWUAWLGKPSSBT3XKCUH7ANCNFSM6AAAAAAYI4ZVRY. You are receiving this because you authored the thread.Message ID: @.***>
I have looked into your dataset and run it on my end.
There are two modifications you need to make: 1 Your current training and validation datasets do not share the same variables. You can run the code down to make them consistent
original_dataset = load("ForHan060723.Rdata") feature_intersect = Reduce(intersect, list(colnames(train_setsample4), colnames(validation_setsample4)))
new_data_train<-train_setsample4[,feature_intersect] new_data_test<-validation_setsample4[,feature_intersect] new_data_validation<-validation_setsample4[,feature_intersect]
new_data_train<-new_data_train[,-c(1,3)] new_data_test<-new_data_test[,-c(1,3)] new_data_validation<-new_data_validation[,-c(1,3)]
2 Your current feature dimension is too high (more than 400). Thus, generalized linear regression and shallow random forest cannot handle and occasionally report bugs. You can consider doing some feature selection first such as a chi-square test for categorical variables and a t-test for continuous variables.
Another suggestion after the features' initial selection is data normalization. I noticed that some of your variables range from 0 to 1 while some features are larger than 1,000. Data normalization will help AutoScore-dependent packages like random forest converge better and you can transform the data back after you obtain the scoring table.
The optimal training dataset doesn't resolve into a data frame if too large. My dataset had 20 variables and 400k rows. I had to reduce to 200k rows.