juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
41 stars 27 forks source link

Predict #TidyTuesday NYT bestsellers | Julia Silge #71

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Predict #TidyTuesday NYT bestsellers | Julia Silge

A data science blog

https://juliasilge.com/blog/nyt-bestsellers/

gunnergalactico commented 2 years ago

Hello Julia,

I am working my way through your SMLTAR book, I have a question about max_tokens. How do you decide what is an appropriate number to use in the model? In some of your other videos, you've gone as high as 1000 and as low as 100. In a real world problem, what are some of the best tips to picking the correct number of tokens?

Thanks!

juliasilge commented 2 years ago

@gunnergalactico For something like "regular" natural language, I start on the higher side (in the thousands) because the vocabulary is larger. For some of the examples I work through that have very constrained vocabularies, like this example of names, going with a smaller number of tokens is better. Overall, though, it's good to realize that the number of tokens is really a hyperparameter of the model and you can tune it.