nliulab / AutoScore

AutoScore: An Interpretable Machine Learning-Based Automatic Clinical Score Generator
31 stars 4 forks source link

Optimal Training Dataset doesn't render if too large #5

Open shlid007 opened 1 year ago

shlid007 commented 1 year ago

The optimal training dataset doesn't resolve into a data frame if too large. My dataset had 20 variables and 400k rows. I had to reduce to 200k rows.

fengx13 commented 1 year ago

Hi, Could you provide more information? What do you mean by saying "The optimal training dataset doesn't resolve into a data frame"? What's the error message?

shlid007 commented 1 year ago

Hi - if the dataset is too large, then it doesn't render from simple functions like ncols(). Instead, it issues a NULL.

shlid007 commented 1 year ago

I was unable to determine the exact threshold of what the tipping point is for dataset size. As a result, I was advised to replicate the study using the original data. I have the MIMIC-III dataset so am in process of trying to replicate study to make sure my copy of the code is sound. Any other suggestions, feel free to share. Thank you.

shlid007 commented 1 year ago

I also ran into issues if I didn't use complete cases. Trying to impute resulted in non-numeric errors. Using complete cases or removing columns with missing values solved the issue. Which is strange because the documentation for AutoScore says missing values should be OK.

shlid007 commented 1 year ago

Hi - I posted some more on Github --> view it on GitHubhttps://github.com/nliulab/AutoScore/issues/5#issuecomment-1581168121

Thank you!

-Sue


From: XIE FENG @.> Sent: Wednesday, June 7, 2023 12:31 PM To: nliulab/AutoScore @.> Cc: Sue Lhymn @.>; Author @.> Subject: Re: [nliulab/AutoScore] Optimal Training Dataset doesn't render if too large (Issue #5)

Hi, Could you provide more information? What do you mean by saying "The optimal training dataset doesn't resolve into a data frame"? What's the error message?

— Reply to this email directly, view it on GitHubhttps://github.com/nliulab/AutoScore/issues/5#issuecomment-1581168121, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3MFT2JJ2KWUAWLGKPSSBT3XKCUH7ANCNFSM6AAAAAAYI4ZVRY. You are receiving this because you authored the thread.Message ID: @.***>

Han-Yuan-Med commented 1 year ago

I have looked into your dataset and run it on my end.

There are two modifications you need to make: 1 Your current training and validation datasets do not share the same variables. You can run the code down to make them consistent

original_dataset = load("ForHan060723.Rdata") feature_intersect = Reduce(intersect, list(colnames(train_setsample4), colnames(validation_setsample4)))

new_data_train<-train_setsample4[,feature_intersect] new_data_test<-validation_setsample4[,feature_intersect] new_data_validation<-validation_setsample4[,feature_intersect]

delete irrelevant features according to me

new_data_train<-new_data_train[,-c(1,3)] new_data_test<-new_data_test[,-c(1,3)] new_data_validation<-new_data_validation[,-c(1,3)]

2 Your current feature dimension is too high (more than 400). Thus, generalized linear regression and shallow random forest cannot handle and occasionally report bugs. You can consider doing some feature selection first such as a chi-square test for categorical variables and a t-test for continuous variables.

Han-Yuan-Med commented 1 year ago

Another suggestion after the features' initial selection is data normalization. I noticed that some of your variables range from 0 to 1 while some features are larger than 1,000. Data normalization will help AutoScore-dependent packages like random forest converge better and you can transform the data back after you obtain the scoring table.