juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

To downsample imbalanced data or not, with #TidyTuesday bird feeders | Julia Silge #82

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

To downsample imbalanced data or not, with #TidyTuesday bird feeders | Julia Silge

A data science blog

https://juliasilge.com/blog/project-feederwatch/

gunnergalactico commented 1 year ago

Hi Julia,

Thanks for the video, I have a couple questions:

  1. Workflow map has a seed parameter, does setting the seed outside of the workflow map override the internal seed parameter?

  2. I noticed you combined classification metrics and probability metrics in your metric set (accuracy, mn_log_loss, sensitivity, specificity). I have had errors in the past about mixing combining both in a metric set. Is this now possible with Tidymodels?

  3. Are there plans to have a metric output similar to classification_report in scikit learn?

Thanks!

juliasilge commented 1 year ago

Thanks for the great questions @gunnergalactico!

Workflow map has a seed parameter, does setting the seed outside of the workflow map override the internal seed parameter?

The way this would work is that you could pass a specific seed in as an argument, or if you set the seed outside the call to workflow_map(), then that seed will be used to pull from the RNG stream. In the second option, it's like you're doing this:

set.seed(123)
sample.int(10^4, 1)
#> [1] 2463

Created on 2023-01-19 with reprex v2.0.2

If you do that over and over, you'll see that you get the same thing every time.

I have had errors in the past about mixing combining both in a metric set. Is this now possible with Tidymodels?

It was always possible, but you will get errors if you use a model that only produces class predictions and does not produce class probability predictions. An example of a model that can only produce class predictions is the LiblineaR SVM model.

Are there plans to have a metric output similar to classification_report in scikit learn?

There's an issue open in yardstick about making a similar report (tidymodels/yardstick#308), so I encourage you to chime in there with your use case and what you're looking for.

NatarajanLalgudi commented 1 year ago

How much of a difference will it make to model performance to replace the missing values using a simple mean based imputation, as opposed to a KNN algorithm driven imputer? Given that 37 variables of the 62 are 1/0 and another 10-12 are also ordinal having between 3 to 9 levels?

juliasilge commented 1 year ago

@NatarajanLalgudi That is not something I would ever dare to guess as to the result, since it depends so much on the relationships in your data. The way to find out is to try out both approaches and evaluate the results using careful resampling. In tidymodels, you could do this with a workflow set (use whatever model you intend to implement, plus one recipe that uses step_impute_mean() and a different 2nd recipe that uses step_impute_knn()).

sweiner123 commented 1 year ago

Hi Julia, Thank you for this great post! I have been wondering about the independence of the observations in this data set. Since some bird feeders might be in the data multiple times, would that create an issue for this analysis? If so, should we aggregate the squirrel sightings to the average sightings at a site? I'm assuming that would change this analysis to a linear regression problem if that is the case. Please correct me if I am wrong. And if independence is not an issue, would you be able to give some insight as to why? Again, I want to emphasize how much I appreciate your posts!

juliasilge commented 1 year ago

That's a great point @sweiner123! I am already treating this as a linear regression problem so I don't think that would change, but dealing with some of the observations being the same bird baths could be a great option. You could either aggregate the data for each bird bath before starting modeling, or you could use a resampling strategy that makes sure that the same sites are together in a resample (to reduce overly optimistic performance estimates). I would lean toward the second and you can read more about this kind of resampling here.

sweiner123 commented 1 year ago

That's some really cool resampling! Thanks for pointing that out!

bmreiniger commented 1 year ago

Nice!

The tradeoff between sensitivity and specificity could maybe be more naturally explored by varying the decision threshold. I'd be curious how the log-loss looks for the resampled cases after applying the adjustment from King and Zeng, e.g. https://stats.stackexchange.com/a/611132/232706

bnagelson commented 11 months ago

Hi Julia, and thanks for all these great posts. I have a question about setting up folds for tuning hyperparameters using cross-validation. Here, you create your folds using the training data set, and each fold in feeder_folds has the same total number of data points (across the analysis and assessment set combined) as the training data set. This seems logical when using feeder_folds for resampling the "basic" workflow because no downsampling was used in the recipe. However, the "downsampling" workflow has fewer data points because some data points belonging to the majority class were removed. I am confused why you can use feeder_folds to resample the downsampled workflow when there is seemingly a mismatch in the number of data points between the folds and the recipe. Thanks!

juliasilge commented 11 months ago

@bnagelson I think you are understanding correctly that when you use downsampling, fewer observations are used for training than when you don't use downsampling. If you use something like feeder_folds, the downsampling is applied to the analysis set and then fewer observations make it into the model itself. You can read more about how subsampling applied in these two spots:

jlecornu3 commented 6 months ago

Hi Julia,

Does tidymodels offer us a way to tweak the cost function for misclassification across the minority and majority classes?

Working on a series of classification models that survey a range of the model engines as you do in the textbook chapter. Looking for an alternative to explore rather than upsampling or downsampling.

Thanks, Joshua

juliasilge commented 6 months ago

@jlecornu3 Yep! You can check out the classification_cost() function from yardstick for that.

gunnergalactico commented 6 months ago

@jlecornu3 Yep! You can check out the classification_cost() function from yardstick for that.

Hello Julia, to piggyback on this question, is there a way to apply class weights in tidymodels? The documentation on the website says it’s experimental but doesn’t have an example of how it would be done to counter class imbalance.

For example in sklearn i would do something like this

from sklearn.utils.class_weight import compute_class_weight

target = train.target_column class_weights = compute_class_weight(class_weight=“balanced”, classes=np.unique(target), y=target)

class_weight=dict(zip(np.unique(target), class_weights)

i can then pass that into the model.

Thanks a bunch!

juliasilge commented 6 months ago

@gunnergalactico I can't quite tell from a brief look at the scikit-learn documentation whether it is more like case weights or more like subsampling for class imbalance. Take a look at these two articles to see which is what you are looking for: