juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Multiclass predictive modeling for #TidyTuesday NBER papers | Julia Silge #52

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Multiclass predictive modeling for #TidyTuesday NBER papers | Julia Silge

Tune and evaluate a multiclass model with lasso regulariztion for economics working papers.

https://juliasilge.com/blog/nber-papers/

kamaulindhardt commented 2 years ago

Thank you once again Julia for an excellent screencast!

I have recently stumbled upon the DALEX package for model agnostic and exploration and I was wondering if you at the tidymodels team have any particular plans in mind to incorporate some of the DALEX functionalities into the tidymodels meta-package framework? I like DALEX but one ting I feel is missing is the possibility of using tidyverse syntax in plotting, such as ggplot2. This would be really cool, as it is often very important for data scientists to effectively communicate the models created, and preferably in a visually appealing format.

All the best, Kamau

kamaulindhardt commented 2 years ago

https://github.com/ModelOriented/DALEX

juliasilge commented 2 years ago

@kamaulindhardt Yep, we have a chapter in TMwR on how to use DALEX for explainability with tidymodels.

Woprates commented 2 years ago

Great blog as always Julia...I applied such code on my data and everything was doing well until the last part when I tried apply final_fitted <- extract_workflow(final_rs)....shouw up the following issue: "Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('last_fit', 'resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

Do you how to solve it? Thanks

Woprates commented 2 years ago

actually the issue was : Error in UseMethod("extract_workflow") : no applicable method for 'extract_workflow' applied to an object of class "c('last_fit', 'resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

juliasilge commented 2 years ago

@Woprates Looks like you need to update some packages, probably tune and maybe workflows? I was using just CRAN versions here, I believe.

Woprates commented 2 years ago

Yeap....I have no idea what to do, because I did exactly the same thing and packages that you used too....my information about R is: platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10) nickname Bunny-Wunnies Freak Out

juliasilge commented 2 years ago

@Woprates I'd make sure that you have version 0.1.6 of tune, the latest version from back in July or so. You can check that in a couple of different ways, such as sessioninfo::session_info().

Woprates commented 2 years ago

Thanks Julia for your attention....actually the version is 0.1.3 textrecipes 0.4.1 2021-07-11 [1] CRAN (R 4.0.5) themis 0.1.4 2021-06-12 [1] CRAN (R 4.0.5) tidylo 0.1.0 2020-05-25 [1] CRAN (R 4.0.5) tidymodels 0.1.3 2021-04-19 [1] CRAN (R 4.0.5) tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.5) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.5) tidytext 0.3.1 2021-04-10 [1] CRAN (R 4.0.5) tidyverse 1.3.1 2021-04-15 [1] CRAN (R 4.0.5) tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.0.3) tune 0.1.3 2021-02-28 [1] CRAN (R 4.0.4) tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.0.5) unbalanced 2.0 2015-06-26 [1] CRAN (R 4.0.5) workflows 0.2.3 2021-07-16 [1] CRAN (R 4.0.5) workflowsets 0.1.0 2021-07-22 [1] CRAN (R 4.0.5)

juliasilge commented 2 years ago

@Woprates Yep, looks like you need to update.packages()

Woprates commented 2 years ago

Thanks a lot Julia...now it's working....you rock...:)

SIRIYAK commented 2 years ago

autoplot(nber_rs) show_best(nber_rs) when i run this two code my rstudio crashes(iam using 16 ram winos 64bit)

juliasilge commented 2 years ago

@SIRIYAK That sounds frustrating! Can you create a reprex (a minimal reproducible example) for this and post your problem on the repo for tune? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")
SIRIYAK commented 2 years ago

Also below <<final_fitted <- extract_workflow(final_rs)>> Error in extract_workflow(final_rs) : could not find function "extract_workflow"

SIRIYAK commented 2 years ago

thanks a lot

On Fri, Oct 1, 2021 at 9:30 PM Julia Silge @.***> wrote:

@SIRIYAK https://github.com/SIRIYAK That sounds frustrating! Can you create a reprex https://reprex.tidyverse.org/ (a minimal reproducible example) for this and post your problem on the repo for tune https://github.com/tidymodels/tune/issues? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help https://www.tidyverse.org/help/ page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/52#issuecomment-932354745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIMCYMUGEWWXL2GPCYDUT3UEXLLRANCNFSM5FBX4QZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

juliasilge commented 2 years ago

@SIRIYAK If you notice the comments above, I think you need to do the same thing -- update to more recent package versions via update.packages().

SIRIYAK commented 2 years ago

iam using encrypted SSD >>updating things is like nightmare>>i tried to update few still same error in r cloud also same error , may be i have try in different machine any way thankπŸ‘

IvanDesuo commented 2 years ago

Julia, your blog and YouTube tutorials are incredible. I have been learning so much since I found your blog and channel. I was trying a more traditional approach using multinom and really struggling to get all the metrics. Yardstick package was a god sent, it worked like a charm. Thanks so much!

SN4AI commented 2 years ago

Hello Julia. First of all, I wanted to congratulate you but above all thank you for the impressive work you do. I have followed several of your tutorials which have boosted my productivity. However, I would like to submit a question to you in the context of model training. I have a very large database. Suddenly it is very difficult to train my model since I have a RAM memory of 8g. Do you have a parallel computing method or other method to solve volumetry problems? Thank you!

juliasilge commented 2 years ago

@SN4AI Unfortunately parallel processing won't solve any problems with running out of memory. Running in parallel typically requires more memory than running sequentially (but it is of course faster). So if you are very, very low on RAM, I recommend only running sequentially and probably using just a subsample of your data for training your model. If your data is in a database, see if you can do any summarizing in the database itself before bringing just the minimum of data into memory in R locally. Also, if you are using a database, look into whether something like tidypredict will work for you.

SN4AI commented 2 years ago

Hello Julia, thank you very much for answering my question. Your answer is more than clear. If not, do you think that increasing my RAM memory from 8G to 16G will have an impact on my ability to train models. In the end, thank you for suggesting 'tidyprdict' to me, I'm documenting myself on this. Thank you!!

juliasilge commented 2 years ago

@SN4AI I have 16 GB of RAM on my main computer and I seldom have problems running out of memory when training models, but this really depends on the particulars of your datasets. The other option you can consider is moving to the cloud (like AWS or similar) for training models.

fvr1210 commented 2 years ago

Hi Julia, thanks a lot for this interesting blog!

Would it be possible to use the weighted log odds instead of the tfidf as feature?

juliasilge commented 2 years ago

@fvr1210 You definitely could! We haven't implemented that in recipes or textrecipes yet but you could either create a custom recipe step or submit an issue on one of those repos.

Woprates commented 2 years ago

Hey Julia, sorry bother you...but I got such issue in my model to predict a class using form comments....when I run nber_rs <- tune_grid( nber_wf, nber_folds, grid = nber_grid ) Warning message: All models failed. See the .notes column.

The .notes erros are:

[[8]]

A tibble: 1 x 1

.notes

1 preprocessor 1/1, model 1/1: Error in lognet(xd, is.sparse, ix, jx, ~ [[9]] # A tibble: 1 x 1 .notes 1 preprocessor 1/1, model 1/1: Error in lognet(xd, is.sparse, ix, jx, ~ Do you have any idea how to solve this?
juliasilge commented 2 years ago

@Woprates Hmmm, that error looks like glmnet did not get all numeric values. Do you still have some factor/string variables? You'll need to convert those into indicator variables, probably using step_dummy(). Or maybe the text hasn't been tokenized and prepared?

Woprates commented 2 years ago

Actually I have factor variables (the outcome + 2 predictors) and one predictor is a text (character).

I am trying to predict the class of the outcome based on those 2 factor variables and 1 character variable (the text).

On Tue, Oct 12, 2021, 19:20 Julia Silge @.***> wrote:

@Woprates https://github.com/Woprates Hmmm, that error looks like glmnet did not get all numeric values. Do you still have some factor/string variables? You'll need to convert those into indicator variables, probably using step_dummy(). Or maybe the text hasn't been tokenized and prepared?

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/52#issuecomment-941736574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMKIZO46UU2MUQOTDJYYUXTUGS7CJANCNFSM5FBX4QZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

juliasilge commented 2 years ago

@Woprates A glmnet model needs all predictors to be in a numeric format so double check that you are converting your factor to a dummy variable and that your text is all tokenized and converted.

Ji-square commented 2 years ago

Done

mosesotieno commented 2 years ago

Great job as always. I use these videos to provide a solid background in my modelling especially using the tidymodels met package. Is there a video on inferential statistics where one may one want to see the effect of a variable say on a given outcome?

juliasilge commented 2 years ago

@mosesotieno I have two things to look at that are along those lines:

mosesotieno commented 2 years ago

@juliasilge thank you. I will have a look at them. Thanks again for very helpful and detailed tutorials.

FelixZhao123 commented 2 years ago

@juliasilge Hi Julia, I am working on a project trying to model if a product is a best seller or not based on the product description. Unlike the papers dataset you are demonstrating, the product descriptions are quite similar to each other, the only difference is may be color or other specs. therefore, it is not the most common words that are important, rather the low frequency words. Can you share some thoughts on this? By the way, thank you for your work! And really looking forward to your new book to be available in China.

juliasilge commented 2 years ago

@FelixZhao123 You might try an approach like this one to find words associated with/not with your outcome and build features for that.

FelixZhao123 commented 2 years ago

@juliasilge Thank you so much for your prompt reply!! I will look into the file. Thanks!

Woprates commented 2 years ago

Hey Julia, Do you already did some work/blog or know where I can find some about Hierarchical Dirichlet Process on Topic Modeling using R?

On Thu, Oct 21, 2021, 22:54 Julia Silge @.***> wrote:

@FelixZhao123 https://github.com/FelixZhao123 You might try an approach like this one https://juliasilge.com/blog/austin-housing/ to find words associated with/not with your outcome and build features for that.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/52#issuecomment-949244765, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMKIZO73HTXB6RCMDWDDTKLUIDG65ANCNFSM5FBX4QZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

juliasilge commented 2 years ago

@Woprates Hmmmm, not off the top of my head right now

nalinichintalapudi commented 2 years ago

@Woprates Hi the same problem what you got related to Warning message: All models failed. See the .notes column. I am also facing the same problem if you sloved this problem,can you please help for me.

juliasilge commented 2 years ago

@nalinichintalapudi Can you create a reprex (a minimal reproducible example) for your problem? The goal of a reprex is to make it easier for people to recreate your problem so that we can understand it and/or fix it. Once you have created a reprex, I recommend that you post on RStudio Community to get help.

nalinichintalapudi commented 2 years ago

@juliasilge I realy feel very glad that I recived replay from you. Your work is very helpful for me in my daily working life.Thank you so much.

coming to my problem,here is the sample data look like what I am working on

year whocode text 2019 Respiratory Diseases keep at rest. Light diet with large intake of liquids. 2020 Injury keep at rest.

year ,whocode,text are three attributes in my data.where you took program_category in your work,I assumed whocode attribute is the category, and similarly instead of title,text attribute I replaced.

The total records are 37906 Total whocode categories are 22

I got Log odds (weighted) of 22 whocode categories.

But,The error what i got is the following one

nber_rs <- tune_grid ( nber_wf, nber_folds, grid = nber_grid )

Warning message: All models failed. See the .notes column.

nber_rs

Tuning results

10-fold cross-validation using stratification

A tibble: 10 Γ— 4

splits id .metrics .notes

1 <split [23539/2617]> Fold01 NULL <tibble [0 Γ— 1]>

2 <split [23539/2617]> Fold02 NULL <tibble [0 Γ— 1]>

3 <split [23540/2616]> Fold03 NULL <tibble [0 Γ— 1]>

4 <split [23540/2616]> Fold04 NULL <tibble [0 Γ— 1]>

5 <split [23540/2616]> Fold05 NULL <tibble [0 Γ— 1]>

6 <split [23541/2615]> Fold06 NULL <tibble [0 Γ— 1]>

7 <split [23541/2615]> Fold07 NULL <tibble [0 Γ— 1]>

8 <split [23541/2615]> Fold08 NULL <tibble [0 Γ— 1]>

9 <split [23541/2615]> Fold09 NULL <tibble [0 Γ— 1]>

10 <split [23542/2614]> Fold10 NULL <tibble [0 Γ— 1]>

.metrics showing null. can you please help me to fix this error and why the error coming.

one more question that, Is it possible to visualize the ROC curves for 22 classes of whocode.

juliasilge commented 2 years ago

@nalinichintalapudi Unfortunately I can't tell from this error alone what has gone wrong. If you can read more about how to use reprex, you can create a small reproducible example that shows people what has happened. The goal is to make it possible for people to recreate your problem so that we can understand it and/or fix it. I also recommend that you post on RStudio Community because it is a better forum for getting help with code problems than something like blog comments.

nalinichintalapudi commented 2 years ago

@juliasilge I will do as per your recomendation,and post in RStudio Community.Thank you so much.

mwilson19 commented 2 years ago

I've been using the "probably" package and ran into an issue with fit_resamples, for whatever reason , even when I save predictions on a classification issue I'm not getting the probabilities, only the classification output. Do you know why?

juliasilge commented 2 years ago

@mwilson19 Hard to say without knowing more about your setup! Can you create a reprex (a minimal reproducible example) for this, and then probably post on RStudio Community? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! πŸ™Œ

RaymondBalise commented 1 year ago

Hello Julia. As always, your work is an invaluable (and enjoyable) resource. I have one small question/concern. In the video dialog and in the text on the website above you wrote:

How did our final model perform on the training data?

collect_metrics(final_rs)

when processing the results returned by last_fit().

I thought if you did collect_metrics() on the results of a last fit object it returns the the performance on the testing data. Do I have that wrong or does the blog text above need a slight tweak?

juliasilge commented 1 year ago

Thank you so much @RaymondBalise! I have fixed this is in 69b6566659d1009f23ae03e58b5e72d56c56bbbc.