marcdotson / earnings-calls

Exploration of relationships between marketing and valence term use and quarterly firm performance.
MIT License
0 stars 0 forks source link

Word embeddings, dictionary creation, and validation #12

Open marcdotson opened 1 year ago

marcdotson commented 1 year ago
wtrumanrose commented 1 year ago

@marcdotson Apologies for not having this when I said I would--I ended up running the BERT model overnight, only to discover in the morning that there was a typo. Also, I went overkill for this and cobbled together three different models. I wanted to really explore what was out there, and as it turns out, there is a lot. Unfortunately, it's not quite as plug-and-play as I would've liked it to be. The word2vec tutorial I used had a painfully-slow custom function to cleanup the data, and even running it with a sliver (1000 obs) of the actual dataset took ~10 mins or so. I uploaded a csv called python_data.csv which is just the word_tokens.rds I tinkered with a bit. Most of the punctuation and stop words should be removed and the text should all be in lowercase. For word2vec and TensorFlow, using python_data.csv would probably be best. I'm still trying to understand BERT, as it is more complex than the others, but if we wanted to use BERT, I think we would want to use the original transcript.

If you'd like, I can split them into three different .py files instead of one .ipynb, since it would be more cohesive with the structure of the repository. For now, however, my brain needs a bit of a break from this.

marcdotson commented 1 year ago

@wtrumanrose great conversation with Carly. Here are her recommendations when we have time to jump back into this:

  1. Start with general pre-trained word embeddings. She prefers GloVe (have you used this before?) but word2vec may work as well. So it sounds like we were on the right path.
  2. There is also the possibility of just using LLM word embeddings (i.e., a pre-trained transformer). We should especially look at the Bloomberg LLM.
  3. The final step up would be to use transfer learning. Here we take an LLM's word embeddings and modify them to our specific context. We'd likely need a GPU cluster to do it still, and the Hugging Face library would provide access to the relevant transformer network's word embeddings.
marcdotson commented 11 months ago

In addition to k-means, some highlights from Carly in the recent email chain, moving here to preserve:

We should compare k-means to a topic model.

If you're looking to do clustering for groupings of terms then that does make sense to leave them as word embeddings. Is there a reason you aren't doing a topic model? It seems like you could achieve something similar by running a topic model to create topics and then examine the top features of those topics as well as the breakdown of topics within a document.

And to affinity propagation.

And affinity propagation is just another centroid based clustering algorithm, the main difference between KMeans and Affinity Prop is that you don't have to predefine a set number of clusters and it also identifies an exemplar observation for each cluster instead of describing the cluster by its average characteristics.

marcdotson commented 9 months ago

Hi, @docsfox. Let's give using this issue a try?

Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison.

clustering-km_tune

clustering-lda_tune

docsfox commented 9 months ago

Hey Marc,

Great, I started running with Fast Kmeans on Friday which was actually pretty quick, breaking up the the data into 1,000 mini batches. I'll be working on it this afternoon and will let you know if I get similar results with the 50d word embeddings!

Best, Carly

On Mon, Oct 2, 2023, 11:16 PM Marc Dotson @.***> wrote:

Hi, @docsfox https://github.com/docsfox. Let's give using this issue a try?

Using the randomly sampled subset of 1 million word tokens and the 50-dimensional word embeddings, I've compared a range of possible topics and clusters. Tuning for the number of topics produces a wacky bend in the log-likelihood, but comparing both I'm going to go ahead and look for a marketing topic and cluster where k = 25. I just have this in R currently (see /code/04_dictionary-identification.R), but I'm interested in the Python comparison.

[image: clustering-km_tune] https://user-images.githubusercontent.com/29615257/272153340-bb31dbfd-d81f-47b2-b91b-656d78a3ba2c.png

[image: clustering-lda_tune] https://user-images.githubusercontent.com/29615257/272153377-35b2e95b-e3ef-4e76-a4a8-3de83e4d2365.png

— Reply to this email directly, view it on GitHub https://github.com/marcdotson/earnings-calls/issues/12#issuecomment-1744225481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATU2DQPWYWDU3XYVUGUNOXDX5ONS3AVCNFSM6AAAAAARHIYKBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBUGIZDKNBYGE . You are receiving this because you were mentioned.Message ID: @.***>