juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Word Vectors with tidy data principles | Julia Silge #22

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Word Vectors with tidy data principles | Julia Silge

Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited!

https://juliasilge.com/blog/tidy-word-vectors/

lijah11b commented 3 years ago

Hello! Would you mind specifying what versions of Python and R Studio you were running in this example?

juliasilge commented 3 years ago

No Python in this blog post, just R! I unfortunately did not note that at the time of publication, but you can see similar work in our almost-completed book that does include versioning information.

epeksoy commented 3 years ago

Hello. Thanks for this great tutorial. I am getting a "Error: Can't rename columns that don't exist. x Column .row.names doesn't exist." error after search_synonyms function and cannot make a search. I tried with two different PCs but still the same. Would you mind telling what I am doing wrong?

juliasilge commented 3 years ago

@epeksoy Unfortunately I can't tell from just that information what has gone wrong, but you might take a look at this more updated version of finding word vectors in my soon-to-be-published book and see if that is more helpful. If you are still having trouble, I would try to create a reprex demonstrating your problem and post it on RStudio Community to get some help.

tiagomramalho commented 1 year ago

Hi! Thank you so much for this and other tutorials. I am commenting because I believe I got a similar error to @epeksoy. I am not sure (as I am pretty much a beginner on this) but after googling a little, it seems that there is some type of conflict between the function search_synonyms and more recent releases of the package broom. Something about tidy.numeric being deprecated... (?) Anyway, hope this helps :)

Just to add what I get as error: Error in chr_as_locations(): ! Can't rename columns that don't exist. ✖ Column .rownames doesn't exist. Run rlang::last_error() to see where the error occurred. Warning message: 'tidy.numeric' is deprecated. See help("Deprecated")

juliasilge commented 1 year ago

@tiagomramalho Instead of the approach in this older blog post, take a look at how we outlined a very similar task in our recent book. The function nearest_neighbors() in that chapter finds the nearest synonyms, in basically the same way as search_synonyms() from this blog post.

PatoLocos commented 1 year ago

Hi Julia, Maybe my Google skills are weak, but I haven't been able to find on your articles about sentence embeddings. OpenAI has this very cheap model that can spit out embeddings, but I have a few massive text corpus (mainly call center data), domain specific for Finance and Banking so, I don't find any of these massive embeddings appropriate.

Do you have a R-based method for sentence embeddings? If so, tokens length and computational cost will be for sure something to talk about.

juliasilge commented 1 year ago

@PatoLocos I did mention looking into word and document vectors, like in the Stitch Fix post, but I have not actually spent time with that. I think the best I can point you to is the more polished work I did on word vectors in our book last year. You might take a look at that and see how tough it is to extend it beyond only words.