juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge #34

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge

Handling class imbalance in modeling affects classification metrics in different ways. Learn how to use tidymodels to subsample for class imbalance, and how to estimate model performance using resampling.

https://juliasilge.com/blog/sliced-aircraft/

gunnergalactico commented 3 years ago

Hi Dr. Silge, thanks for the analysis. I do have a question about the bag tree engine argument "times". How did you settle on 25 as the number of times to run the bag tree model? Is there more documentation that you can link to to better understand this? In some of your other analysis you've used different numbers.

Can you please explain that a little further? Is the times argument also used with there tree models? Thanks.

juliasilge commented 3 years ago

@gunnergalactico Using times = 25 is probably a bit low for really good performance with a bagged tree model. You can read this section of the excellent HOML for more background on it.

daver787 commented 3 years ago

How did the mac mini perform? I am thinking of getting one but was hesitant because I thought the new mac chips were not compatible with a lot of data science tools.

juliasilge commented 3 years ago

I am having a really nice time with my Mac mini @daver787, and things are FAST. I even have gotten TensorFlow working. Some pain points for me right now are a few reticulate packages where data gets passed back and forth between Python and R between native ARM and the Rosetta emulation mode, which can be painfully slow when you have a lot of resampling folds, and I can't get catboost natively installed on it. If I am working all in R, I am quite happy. My take is that native support in R is better than in Python as of right now.

Ji-square commented 3 years ago

done

harris-yh-wong commented 3 years ago

Hello may I ask whether the step_zv should be the last preprocessing step? should it goes after step_dummy? or step_smote? or currently it is okay already? Because let's say I try another model like logistic regression then warnings about rank-deficiency is thrown out.

juliasilge commented 3 years ago

@harris-yh-wong We outline some advice on ordering of recipe steps here that may be helpful but it doesn't talk about subsampling to address class imbalance there. In general, a subsampling step should be last in your feature engineering; I think I'd do it after step_zv() (which should also be pretty late).

Chaarvi269 commented 2 years ago

This was a very interesting read. My basic knowledge of Difference between Analysis and Analytics helped me understand this in a much better way.

conlelevn commented 2 years ago

@juliasilge Hi Julia, in the preprocessing step, you have used few steps to handle some missing values in factors variables of training set. As far as I understand, in this step, you used step_novel to assign missing value in training set to a new level in testing set (if its available), and used step_unknown to assign missing value in training set to unknown class (also a new level). Does these 2 steps similar to each other and can we only use one of them at one time to preprocess the data?

juliasilge commented 2 years ago

You can read more about these two steps, which handle new levels (levels that are new at prediction time or in the test data, not in the training data) and missing levels:

jrosell commented 5 months ago

@juliasilge I guess that instead of: bird_folds <- vfold_cv(train_raw, v = 5, strata = damaged)

It should be: bird_folds <- vfold_cv(bird_df, v = 5, strata = damaged)

juliasilge commented 5 months ago

@jrosell Ah yep, looks like I intended to not carry some of those other variables around through the rest of the modeling. 👍