juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
41 stars 27 forks source link

Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels | Julia Silge #68

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels | Julia Silge

A data science blog

https://juliasilge.com/blog/himalayan-climbing/

conlelevn commented 2 years ago

Hi Julia,

I'm little bit confuse in understand this result:

glm_rs %>% conf_mat_resampled()

A tibble: 4 x 3

Prediction Truth Freq

1 died died 55.5

2 died survived 2157.

3 survived died 26.5

4 survived survived 3499.

Does the Freq column represent the relative or absolute value? how can we interpret this table?

juliasilge commented 2 years ago

You can read more about a confusion matrix to learn about this; Freq is a count of observations. You have non-integer values because this is a resampled confusion matrix.

Steviey commented 2 years ago

Using recipe >> themis::step_smote() produces this, while cross validation...

x Fold6: preprocessor 1/1: Error in smote_impl(): ! Not enough observations of '4' to perform SMOTE. How can I avoid it? The data is already stratified with...

theSplit <- df %>% initial_split(prop=0.8,strata = value) and... myFolds <- vfold_cv(df_train,strata=value)

juliasilge commented 2 years ago

@Steviey Do you get this error with the example data here from the blog post? Or your own data? It sounds like the dataset is too small for SMOTE perhaps?

Steviey commented 2 years ago

Hi Julia, nice to hear from you. I'm new to tidy-multiclass predictions I get it with my own data. I thought the whole point of SMOTE is to handle underrepresented minority classes. If I filter out the underrepresented minority class (4) SMOTE works. Is it a good practice to do so?

    balanceInfo     <- df_train %>% count(value)
    print('balanceInfo:')
    print(balanceInfo)
    stop()

unfiltered, not working SMOTE: value n 1 1 318 2 2 113 3 3 45 4 4 4


filtered, working SMOTE: value n 1 1 323 2 2 111 3 3 38

juliasilge commented 2 years ago

@Steviey It looks like there are only 4 examples of that class? Trying to use SMOTE with that kind of data doesn't sound like a good idea to me; that's probably why there are protections against it. You will need to think through how realistic it is to build a multiclass model where one class has only 4 observations.

Steviey commented 2 years ago

Ah, OK thank you. That was my idea too. I'm restructuring the data, to get a slight imbalance with enough observations in each class.

Steviey commented 2 years ago

I' m always confused, how to predict in the future (in production). Should we use the test-set as 'leave one out', or should we better produce a future frame- like with timetik::tk_make_future_timeseries()?

juliasilge commented 2 years ago

@Steviey Most predictive models are not time series, even though you take the trained model and then predict in the future after the model was fitted. You might find the 3rd paragraph there especially useful for understanding.

If you are not dealing with a forecasting model, then you will use predict() for production with your trained (non-time-series) model. The features might not involve any time components at all.

Steviey commented 2 years ago

Thank you Julia, as I understand for non-time-series, in production I fit a model with all data I have and then predict() with the fitted model. But what should we do with param: newdata or new_data of the predict()-function? Would this mean, I need at least one dataset out of sample to make a production prediction in classifications and regressions? And is the outcome of that prediction then a projection in the future or in the present?

juliasilge commented 2 years ago

You generally fit a model to use in a predictive way when you will get new data in the future; you use the existing pool of historical data for building and evaluating your model (training and testing) and then you predict after you are done with new examples/observations. You might find these resources helpful:

Steviey commented 2 years ago

So if partitions: training + testing = all data I have (historical + present), what would I feed in the param newdata of function stats::predict() when doing classifications or regressions?

juliasilge commented 2 years ago

You would use the new data you get moving forward; people typically categorize these kinds of predictions as batch or real time/online. For discussion like this on ML in general, you may have a better experience posting on RStudio Community, which is a great forum for getting perspective on these kinds of modeling questions.

Steviey commented 2 years ago

I do my best... https://community.rstudio.com/t/stats-predict/145815

DanielYooCDC commented 1 year ago

Hi Julia, Thanks for the post. This is really helpful. I am also a little confused interpreting confusion matrix. Shouldn't below confusion matrix be balanced since you put upsampling procedure in your recipe? I was expecting sum of row 1 and row 3 frequency should be similar to the sum of row 2 and row 4 frequency. glm_rs %>% conf_mat_resampled()

A tibble: 4 x 3

Prediction Truth Freq

1 died died 55.5

2 died survived 2157.

3 survived died 26.5

4 survived survived 3499.

Thank you!

juliasilge commented 1 year ago

@DanielYooCDC When you subsample (upsample or downsample) it's very important to only do that for the training data, not the testing data. This also applies within resampling, like for what we call analysis and assessment sets -- only subsample the analysis set. The tidymodels functions take care of this for you and you can more about this here and here.

DanielYooCDC commented 1 year ago

Hi Julia, Thank you for the response! That makes a whole lot of sense. I have additional question that has been hovering around my head. I do understand step_smote only takes numerical variables and so you did convert categorical variables to dummy variables. After upsampling, you'd get some value between 0 and 1 for dummy variables (for example, upsampled observation of season_autumn variable would be 0.67). In reality, dummy variable should be either 0 or 1. How should we justify the model that has been trained on upsampled training data but the value is far from reality? I noticed there are other upsampling methods like step_smotenc, which takes both categorical and continuous as input. When I tested out step_smotenc without creating dummy variable and create dummy variables before running step_smote , the results were comparable. Thank you so much for your time!

juliasilge commented 1 year ago

@DanielYooCDC The new synthetic observations being created via the SMOTE algorithm aren't real anyway, so it's not a problem that they can end up with a value that is not 0 or 1. I would point out that 0.67 is not "from from reality" at all, but nicely between 0 and 1. I would expect (or at least hope!) that the various implementations of upsampling with SMOTE give you about the same results.

bnagelson commented 8 months ago

Hi Julia, thanks again for this helpful material. I understand that subsampling should not be applied to the testing set, but I am confused how we can use the same workflow that that we applied to our training set (members_wf) in combination with last_fit() without applying step_smote() contained within the workflow. Is there something inherent within last_fit() that prevents this from happening?

juliasilge commented 8 months ago

@bnagelson Yep, you can more about this here, but what controls that behavior is the skip argument of each recipe step.

bnagelson commented 8 months ago

Excellent, thank you very much!

On Wed, Jan 10, 2024 at 1:11 PM Julia Silge @.***> wrote:

@bnagelson https://github.com/bnagelson Yep, you can more about this here https://www.tmwr.org/recipes#skip-equals-true, but what controls that behavior is the skip argument of each recipe step.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/68#issuecomment-1885740182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARGPZNHJ2ZQDY576AW775LLYN37ZPAVCNFSM5W7P5L42U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBYGU3TIMBRHAZA . You are receiving this because you were mentioned.Message ID: @.***>

-- P. Bryant Nagelson Bisbing Forest Ecology & Silviculture Lab Department of Natural Resources and Environmental Science University of Nevada, Reno (415) 971-9966

NizePetcharat commented 4 months ago

Hi Julia, Thank you so much for your #TidyTuesday contributions. They are incredibly useful for learning from real datasets. I have a question regarding your latest visualization. I noticed the top high estimates of variables in your model. Could you clarify whether these variables are predicting the status of "died" or "survived"? Additionally, how can we set the binary outcome to specify which status we want to predict in the model? Thank you!

juliasilge commented 4 months ago

@NizePetcharat In this case, the model coefficients are for predicting "survived" compared to "died". You can specify that by setting your factor levels by hand and/or by using the event_level argument in yardstick metrics, as shown here for sensitivity.

NizePetcharat commented 3 months ago

Hi Julia, thanks again for the helpful material. I am curious about when you checked after baking members_rec, the number of died and survived was equal (56K-56K). However, the resampling results in each fold show a total of 51.6K/5.4K (57K), which matches the initial target total not the upsampling method. If I am wrong, please correct me. My question is, even we confirmed in the workflow that step_smote was applied, how can we ensure that all up-sampled data is included in the training process? Thank you for your time and assistance.

juliasilge commented 3 months ago

@NizePetcharat You can read more about how subsampling is handled in these links:

When you upsample data, it is included during training, but not ever when evaluating, testing, or estimating performance; you don't want to evaluate performance on upsampled data but data with the original proportion of the class.