juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews | Julia Silge #31

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews | Julia Silge

A lot has been happening in the tidymodels ecosystem lately! There are many possible projects we on the tidymodels team could focus on next; we are interested in gathering community feedback to inform our priorities.

https://juliasilge.com/blog/animal-crossing/

BehnamCA commented 3 years ago

@Juliasilge, this is the second work I watched from you and I am very impressed. Thank you so much for all the great work. I already used your Linear SVM model in TV shows vs Series' video. But, after watching this video I noticed I can develop this model too where I have Healthy (Positive) vs Sick (Negative). My question is that, as far as I understood you basically use a sort ofLogistic regressionfor classification whereas I was expecting to see some use ofsentiment analysispackages such asAFINN,bingornrcto score the texts and then cross validates them with the grades in the data. Could you please explain a bit what are the drawbacks of the default sentiment packages that made you develop aLasso regressionfor scoring (penalties`)? Again, I truly appreciated you sharing your extensive knowledge with us.

juliasilge commented 3 years ago

@BehnamCA In this blog post, we build a model to learn which tokens are predictive of text being scored positively or negatively; this is a great approach when you have labeled text data and almost always better than using sentiment lexicons. If you have unlabeled text data and want to estimate the affect/sentiment content of the text, sentiment lexicons can be a good option to use.

BehnamCA commented 3 years ago

@juliasilge, thanks a lot. It was very helpful as always.

nguyenlovesrpy commented 2 years ago

I have learnt a lot of from your tutorials. Why we should center and scale the data for this model? I did try to search it on Google, but I still don't find the answer. Could you explain this?

juliasilge commented 2 years ago

In our book we have a set of recommended preprocessing steps for different models. For glmnet in particular, check out the "Preprocessing requirements" in the parsnip docs to see what happens there.

conlelevn commented 2 years ago

Although textmining is not my major field but it always very helpful to watch your model fitting process, one question for this screecast: you have created the test dataset but did not use it when you use last_fit() function...what's the reason for that? did the workflow handle it already?

juliasilge commented 2 years ago

@conlelevn The last_fit() function takes the split as its input, which contains the info on both the training and testing data sets. This function will fit one final time to the training data and evaluate one final time on the testing data.

mohamedelhilaltek commented 11 months ago

thank you julia for your blogs i learn from you a lot i have a question why did you say when you plot a histogram of number of word in each review is weired distribution

juliasilge commented 11 months ago

@mohamedelhilaltek I believe you are referring to when I was looking at this plot:

image

Notice that sharp drop/dip around 100 words or so? Almost certainly that can't be a "real", natural distribution for number of words that people use in their reviews but instead an artifact of how this data was collected.

acarpignani commented 7 months ago

Hi @juliasilge, sorry for asking. Is it possible that the package vip has changed some specifications from 2020 to 2023? I have done exactly the same steps that you have done, but even though the predictions are comparable, the Importance values are really different. Yours go 0 to 15 and mine are 0 to 0.5. In fairness, I don't really know what you do when you use vi, so I am following along without really knowing what I'm doing in that part, but the difference in these numbers is astonishing and I can't figure out why. Any help to get to the bottom of this would be greatly appreciated.

Thank you again for all your hard work with these videos and for the interesting tutorials that you have posted on youtube. I've gained a wealth of knowledge thanks to you.

juliasilge commented 7 months ago

@acarpignani No need to apologize! I think the vip documentation is pretty good at explaining what it's doing for different models. In the case here, we are using model specific variable importance (i.e. getting the variable importance from the structure of the model itself), and you can read about what the means for different kinds of models.

I'm using glmnet here and it looks like there have definitely been some fixes for glmnet since I wrote this blog post. The differences are probably due to that. 👍

acarpignani commented 7 months ago

Thank you ever so much, @juliasilge. It must be because of that. Shame because now it comes out that the importance bar charts give me now much less variables, most of them being zero. But I got the overall idea of this tutorial. By the way, I wholly adore the way you reshape the data sets to make the data set you need. I wish I was the 1% as good as you are at that 😍