juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Get started with tidymodels and #TidyTuesday Palmer penguins | Julia Silge #28

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Get started with tidymodels and #TidyTuesday Palmer penguins | Julia Silge

Build two kinds of classification models and evaluate them using resampling.

https://juliasilge.com/blog/palmer-penguins/

ghost commented 3 years ago

Hi Julia! Thank you so much for the wonderful help you provide people like me in this and your other tidymodels videos/blog posts! I have no idea what I would do if I hadn't found this. I LOVE what I'm learning about tidymodels!

I have one question about the logistic regression results. When it says that the ORs for Gentoo and Chinstrap are nearly zero, doesn't that mean that the odds of being male are about 99% lower if a penguin is Gentoo, or Chinstrap, than if the penguin is Adelie? I got the same results as you, and I'm just really puzzled by that since the male/female ratios in each species are very similar (all around 50-50). There should be an odds ratio of about 1 when comparing any species to another, correct?

Another thing that struck me as odd is that the std.error values are too big for the OR estimates of species to be significant, but the p-values are still significant. This is not the case for the other coefficients.

juliasilge commented 3 years ago

@jcutler79 I think it's to do with the intercept and how many Adelie penguins there are:

library(tidyverse)
library(palmerpenguins)
na.omit(penguins) %>% count(sex, species) %>% pivot_wider(names_from = sex, values_from = n)
#> # A tibble: 3 x 3
#>   species   female  male
#>   <fct>      <int> <int>
#> 1 Adelie        73    73
#> 2 Chinstrap     34    34
#> 3 Gentoo        58    61

Created on 2021-05-10 by the reprex package (v2.0.0)

It's just that the odds of being a Chinstrap (or Gentoo, less so) are a lot lower overall.

The standard errors, unfortunately, do not get put on the exponentiated scale; they are still on the scale of the model coefficients. 😔 Look at the difference between tidy() and tidy(exponentiate = TRUE).

RafaelEduardoDiaz commented 2 years ago

Hello Julia, great tutorial, how could you save the final model in .rds format and use it to make new predictions with the predict() function.

juliasilge commented 2 years ago

You can use something like readr::write_rds() or saveRDS(), like I show in this post.

venkatpgi commented 2 years ago

Hi Julia, I have been regularly reading your blogs and they have been helping me significantly (statistically of course). While running a logreg model, I have hit a small road block. In the data preprocessing stage, I used the step_log function (base=10) to normalise the skewed data. Now at the end of the model fitting, I wish to back transform the estimate and std.error and also calculate the OR with it's 95% C.I. This is essential to make the model parameters interpretable. Does this mean, I would use tidy(exponentiate - TRUE) once to get the OR from estimate and then back transform by mutate(estimate - exp(estimate)) once again to get the back transformed OR? If that is correct, is there a short cut using some sort of dplyr verb? Second - what about the standard error? How do I get the 95% CI I would really appreciate your response. Thanks in advance

venkatpgi commented 2 years ago

I have attempted like this

final_model <- final_model$.workflow[[1]] %>% tidy(exponentiate = TRUE) %>% #this step is to get OR mutate(Odds = exp(estimate), CI = 1.96*std.error) #this step is to back transform %>% mutate(uci = exp(estimate+CI), lci = exp(estimate-CI)) #deduce the upper and lower 95% CI %>% select(term, estimate, OR, lci, uci, p.value) %>% filter(p.value<0.05)

pls guide me further

juliasilge commented 2 years ago

@venkatpgi If you transformed with a log before modeling, I would definitely suggest using tidy(exponentiate = TRUE) to get more interpretable results. I suggest that you check out:

I'm not sure if it will be possible for you to do some of the kinds of transformations you are suggesting.

venkatpgi commented 2 years ago

Thanks for the prompt response Julia and I had gone through the links that you have shared. They were very useful to understand the concepts. I would wish to simplify my query. As the numerical variables have undergone log transformation as part of the pre-processing and once again the "logit" function has taken the log Odds as part of the logistic regression, should I exponentiate the estimate twice instead of once (which I would have done in any case to get the OR from the estimate using the tidy(exponentiate = TRUE) function)? I guess I have asked my question more clear...

juliasilge commented 2 years ago

@venkatpgi Ah thanks for the clarification. I think the SO question I linked is probably most helpful for that and would say that you should not; instead, by using a logged predictor, the results you have give you OR for an x-fold increase.

conlelevn commented 2 years ago

Hi Julia, great screencast as usually

Just like to ask you a question: I can see that you dont transform the species as a dummy variables by using recipes, but then it still appears in the fitted model, does the glm model automatically do it for us? if its true, then generally next time when we do glm model, do we need to care about the dummy transformation or just left it for the model to handle?

juliasilge commented 2 years ago

@conlelevn You can read about how a workflow() uses a formula in this section of our book. In the cases in this blog post, a glm model will create dummy/indicator variables for factors so that is what workflows will do. The ranger model does not create dummy/indicator variables because it can handle categorical data natively. If you use a recipe instead of a formula preprocessor, then you are the boss of creating any needed dummy/indicator variables.

fdepaolis commented 1 year ago

It seems that the dataset "penguins" doesn't have the variable "year" anymore (I installed the library from GitHub). I guess, one could get it from the variable 'Date Egg' in the dataset "penguins_raw," but since 'year' is not used, I just removed it from the script. Thank you for a wonderful example of TidyModels!!

mohamedelhilaltek commented 1 year ago

hi julia , i have a train and test data and i want to create a split object based on those parts is it possible

juliasilge commented 1 year ago

@mohamedelhilaltek Yep, you can do that with make_splits():

library(rsample)

training <- mtcars[1:20,]
testing <- mtcars[21:32,]

make_splits(training, testing)
#> <Analysis/Assess/Total>
#> <20/12/32>

Created on 2023-05-31 with reprex v2.0.2

mohamedelhilaltek commented 1 year ago

and i have an imbalanced data should i balance the entire dataset or just the train part thanks a looot julia .

juliasilge commented 1 year ago

@mohamedelhilaltek Only the training data -- you can read more about when to skip feature engineering steps on new or testing data. I recommend you read that chapter as a whole, actually.

venkatpgi commented 1 year ago

Hi Julia, I was closely watching the discussions about transformation of features. Imputation of missing data is one important such transformation as many model algorithms are averse to missing data. A challenge in missing data imputation is "one variable with missing data conditional on another variable". For example, a question like "age when became pregnant" (in a survey) is conditional on gender, as a male gender cannot become pregnant (as of today !!!). In such conditions, the age variable will be having missing data for male gender rows and imputing that will be erroneous. If one cannot drop such variables from modeling, how do one go about imputing?

Warm regards Venkat

On Thu, Jun 1, 2023 at 1:22 AM Julia Silge @.***> wrote:

@mohamedelhilaltek https://github.com/mohamedelhilaltek Only the training data -- you can read more about when to skip feature engineering steps on new or testing data https://www.tmwr.org/recipes.html#skip-equals-true. I recommend you read that chapter as a whole, actually.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/28#issuecomment-1570849326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQO6R23IEJZMGNYLWAWU7PTXI6OOZANCNFSM44RUJXKQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr.S.Venkataseshan Professor Division of Neonatology Department of Pediatrics PGIMER, Chandigarh India-160012 phone: +91-172-4735794 (Residence) +91-172-2755340 (Office) +91-9478001129 (Personal Mobile) +91-7087008487 (Official Mobile) Mail: @.***

Hamza-Gouaref commented 1 year ago

Hi julia ! hope you're doing well, i have a question , i have unbalanced data ( 339000 obs / 7000 obs) i Wonder if the right method to balance it is downsample ? cause i have a lot of observations , what do you think 🤔?

juliasilge commented 1 year ago

@venkatpgi In a situation like that, I would probably try a model that can learn from data with missing values, like xgboost. It sounds like a missing value may be meaningful in this case and not imputing could be a good option. If I did need to impute, I would probably use something very simple like the median, hoping that what is meaningful for the outcome are the more extreme values (like very high or low ages at pregnancy). A tricky question! This is an area where domain knowledge and feature engineering are really important.

juliasilge commented 1 year ago

@Hamza-Gouaref Take a look at this article on downsampling/upsampling and the themis documentation. With that much data, I would probably try downsampling first and see how it goes.

JimdareNZ commented 3 months ago

Hi Julia,

I continue to learn so much from your blog posts! One quick question, is there any way to provide a metric of certainty for a classification model? Say for example, I wanted to predict penguin sex, but also provide a metric of how certain the model is with this prediction.

Regards, James

juliasilge commented 3 months ago

@JimdareNZ Yes, what you are looking for are the probabilities that your model can generate. If the model predicts 0.9 probability of being a male penguin, that is very different from 0.6. You'll want to check out class probability metrics such as log loss to estimate how right/wrong your model is.

NizePetcharat commented 1 month ago

Hi Julia,

Thank you for your prompt and continuous responses to questions from learners and users.

I've noticed that in many tutorials, the seed is always set during the data splitting and v-fold cv steps. However, some tutorials set the seed when performing fit_resamples or tuning parameters, and I haven't seen it set during the last_fit step.

My question is: which steps are essential for setting the seed to ensure reproducibility, especially when we want to publish the code? Additionally, should the same seed number be used for every step, or does it depend on the specific design of the analysis? Thank you!

juliasilge commented 1 month ago

@NizePetcharat You definitely do not need to use the same seed in different steps, but just any seed at all so you can get reproducible results. Strictly speaking, you only are required to set the seed one time in an analysis so long as you always run the code from top to bottom. I tend to set the seed before I do something that involves randomness, like splitting/resampling data or fitting models that involve randomness (like random forest), in case I end up needing to re-run some code, for example, if there is something wrong and it errors.