Open marcdotson opened 1 year ago
@marcdotson Apologies for not having this when I said I would--I ended up running the BERT model overnight, only to discover in the morning that there was a typo. Also, I went overkill for this and cobbled together three different models. I wanted to really explore what was out there, and as it turns out, there is a lot. Unfortunately, it's not quite as plug-and-play as I would've liked it to be. The word2vec tutorial I used had a painfully-slow custom function to cleanup the data, and even running it with a sliver (1000 obs) of the actual dataset took ~10 mins or so. I uploaded a csv called python_data.csv
which is just the word_tokens.rds
I tinkered with a bit. Most of the punctuation and stop words should be removed and the text should all be in lowercase. For word2vec and TensorFlow, using python_data.csv
would probably be best. I'm still trying to understand BERT, as it is more complex than the others, but if we wanted to use BERT, I think we would want to use the original transcript.
If you'd like, I can split them into three different .py files instead of one .ipynb, since it would be more cohesive with the structure of the repository. For now, however, my brain needs a bit of a break from this.
@wtrumanrose great conversation with Carly. Here are her recommendations when we have time to jump back into this:
In addition to k-means, some highlights from Carly in the recent email chain, moving here to preserve:
We should compare k-means to a topic model.
If you're looking to do clustering for groupings of terms then that does make sense to leave them as word embeddings. Is there a reason you aren't doing a topic model? It seems like you could achieve something similar by running a topic model to create topics and then examine the top features of those topics as well as the breakdown of topics within a document.
And to affinity propagation.
And affinity propagation is just another centroid based clustering algorithm, the main difference between KMeans and Affinity Prop is that you don't have to predefine a set number of clusters and it also identifies an exemplar observation for each cluster instead of describing the cluster by its average characteristics.
Hi, @docsfox. Let's give using this issue a try?
Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25
. I just have this in R currently (see /code/04_dictionary-identification.R
), but I'm interested in the Python comparison.
Hey Marc,
Great, I started running with Fast Kmeans on Friday which was actually pretty quick, breaking up the the data into 1,000 mini batches. I'll be working on it this afternoon and will let you know if I get similar results with the 50d word embeddings!
Best, Carly
On Mon, Oct 2, 2023, 11:16 PM Marc Dotson @.***> wrote:
Hi, @docsfox https://github.com/docsfox. Let's give using this issue a try?
Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison.
[image: clustering-km_tune] https://user-images.githubusercontent.com/29615257/272153340-bb31dbfd-d81f-47b2-b91b-656d78a3ba2c.png
[image: clustering-lda_tune] https://user-images.githubusercontent.com/29615257/272153377-35b2e95b-e3ef-4e76-a4a8-3de83e4d2365.png
— Reply to this email directly, view it on GitHub https://github.com/marcdotson/earnings-calls/issues/12#issuecomment-1744225481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATU2DQPWYWDU3XYVUGUNOXDX5ONS3AVCNFSM6AAAAAARHIYKBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBUGIZDKNBYGE . You are receiving this because you were mentioned.Message ID: @.***>