juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Modeling #TidyTuesday GDPR violations with tidymodels | Julia Silge #66

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Modeling #TidyTuesday GDPR violations with tidymodels | Julia Silge

A data science blog

https://juliasilge.com/blog/gdpr-violations/

conlelevn commented 2 years ago

Hi Julia,

I've tried to extract the error term of the model to check for it distribution, I have used tidy() and extract_fit_parsnip() but it only gave me predictors and intercept only. Could you please tell me how to extract the error term? Thanks

conlelevn commented 2 years ago

BTW, dont you think we can add some asterick (*) beside the significant p-value like the old classic version so we can have better visualize in this case?

juliasilge commented 2 years ago

@conlelevn If you would like residuals for your data points, try out using augment(). If you like seeing the output of summary(), there's no harm is calling that first to see it as a human, and then using the functions like tidy() and augment() for output that you can use for visualization, summarization, etc. I often do that!

conlelevn commented 2 years ago

Thanks Julia, its very helpful indeed

I have another questions:

  1. in step_log() you have argument skip=TRUE, what does it use for?
  2. I have seen you use function step_zv() quite regularly but t still not get the meaning for it?
  3. In code lines: gdpr_tidy <- gdpr_raw %>% transmute(id, price, country = name, article_violated, articles = str_extract_all(article_violated, "Art.[:digit:]+|Art. [:digit:]+") ) %>% mutate(total_articles = map_int(articles, length)) %>% unnest(articles) %>% add_count(articles) %>% filter(n > 10) %>% select(-n)

you have used length() function and apply it to articles (which is a chr variables) and its quite counter-intuitively for me since I think it will return the actual number of character in each cell of articles column but turn out its not. So what is the length() function actually doing in this case?

juliasilge commented 2 years ago

@conlelevn Since I wrote this blog post, we have decided that preprocessing the outcome like this isn't a great idea. You can read more about how to use skip = TRUE here.

The recipe step step_zv() removes variables that contain only a single value, like if you have a whole column with the same value. This can be good to use after transformations (like step_dummy()) that may result in columns with just one value.

The str_extract_all() function outputs a character vector, and then using length() gives you how many elements there are in the character vector. You can use nchar() to find the number of characters.

my_vector <- c("tidymodels is", "fun")
length(my_vector)
#> [1] 2
nchar(my_vector)
#> [1] 13  3

Created on 2022-05-12 by the reprex package (v2.0.1)