juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Which #TidyTuesday Netflix titles are movies and which are TV shows? | Julia Silge #23

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Which #TidyTuesday Netflix titles are movies and which are TV shows? | Julia Silge

Use tidymodels to build features for modeling from Netflix description text, then fit and evaluate a support vector machine model.

https://juliasilge.com/blog/netflix-titles/

mandarpriya commented 3 years ago

Dear Maam, I trying to practice the code, but got this error message svm_linear () not found(Error in svm_linear(): could not find function "svm_linear"). So is it due to some additional package needs to be installed. I would be eagerly waiting for your reply

rnnh commented 3 years ago

Dear Maam, I trying to practice the code, but got this error message svm_linear () not found(Error in svm_linear(): could not find function "svm_linear"). So is it due to some additional package needs to be installed. I would be eagerly waiting for your reply

I think you need to install the development version of parsnip @mandarpriya. You can do so with the command devtools::install_github("tidymodels/parsnip"). If you do not have devtools, you can install it using install.packages("devtools"). Once you have installed parsnip from GitHub, restart your R session and try again.

rnnh commented 3 years ago

Thanks for posting this, it's informative and well-written. I hadn't seen machine learning applied to text in R before!

You mentioned that svm_linear() is currently available on the development version of parsnip. Out of interest, how long does it usually take to add a new model to the CRAN release? What sort of bottlenecks do you encounter when adding a model (e.g. writing tests, resolving conflicts with existing functions, testing on different platforms)?

juliasilge commented 3 years ago

@rnnh We have documented the process of adding a new model to tidymodels here, if you are interested in that. We've scoped the tidymodels packages in a limited enough way that we don't have any bottlenecks that are too onerous to doing a CRAN release at this point, but all those pieces you mention do have to be dealt with! For svm_linear() the biggest challenge was asking the LiblineaR maintainers to make some changes to their package, but they were super responsive and helpful. 👍

mandarpriya commented 3 years ago

Dear Maam, I was able to over come the error by installing the package. Now i am encountering error when executing svm_rs %>% conf_mat_resampled(tidy = FALSE) %>% autoplot()
R shows fatal error. So what could be the cause for this error.

mandarpriya commented 3 years ago

Thanks Maam, I was able to complete the code in google colab, and it was really amazing. But my R-Studio doesn't do the same, still encountering fatal error.

juliasilge commented 3 years ago

@mandarpriya If you are able to share the specific error on RStudio Community together with the code that generated it (a reprex is best) I think we will be able to find the problem!

msevi commented 3 years ago

@mandarpriya try updating the tune package.

BehnamCA commented 3 years ago

Thanks Julia, it was impressive. Just in the end, when you use if_else to classify estimates based on the classes of the target variable ( Tv shows vs Series). how would you know which estimate sign should be assigned to the TV shows? I mean in this line : sign = if_else(sign, "More from TV shows", "More from movies") or perhaps I should ask what is the order of your target variable? in my data I have Healthy vs Sick. Thanks again and it is truly appreciated.

hardin47 commented 3 years ago

@juliasilge thanks for the great post! how did you remove the stop words from the svm_linear? when i followed your analysis, my final results had a lot of stop words in it. seems like the information should go somewhere in here:

  step_tokenize(description) %>%
  step_tokenfilter(description, max_tokens = 1e3) %>%
  step_tfidf(description) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_smote(type)

thanks for any suggestions.

juliasilge commented 3 years ago

@BehnamCA In this case, "Movie" is the first level because alphabetically it comes first before "TV Show"; it looks like the first level for you should be "Healthy", unless you have manually changed the order of your factor.

juliasilge commented 3 years ago

@hardin47 I did not remove the stop words, actually! I think I tried it both ways and I got better results in terms of model performance by including stop words. (This is not uncommon.) The reason I don't have stop words in my visualizations is that those focus on the terms with the biggest coefficients, which all turn out to not be stop-word-type tokens.

BehnamCA commented 3 years ago

@hardin47 You could add step_stopwords(text) %>% right after step_tokenize(description). I advise you watch the other tutorial of @juliasilge about [sentiment analysis for classification Animal Crossing reviews] (https://juliasilge.com/blog/animal-crossing/).

BehnamCA commented 2 years ago

@juliasilge I used this great piece in of my projects and turned out decent results. Now, I received some more data besides the text data for prediction. My new data are all numerical data. I wonder if you could advise me if there is an approach to use text as one column along with some other numerical columns for predicting the outcome of interest? I truly appreciate your extensive knowledge and help.

Thanks!

juliasilge commented 2 years ago

@BehnamCA Yep, you sure can! You might check out this section and that whole chapter as well for context.

BehnamCA commented 2 years ago

@ @juliasilge , my apologies, just saw this. Thank you so much for your help.

conlelevn commented 2 years ago

@juliasilge in this code line: sign = if_else(sign, "More from TV shows", "More from movies") I dont see you include any condition in here why can R still compare and classify it correctly?

juliasilge commented 2 years ago

@conlelevn Notice that earlier in that same chunk we use sign = estimate > 0, which means sign is a logical, all trues and falses.