Initial exploratory data analysis

marcdotson commented 2 years ago

There are a number of things we can consider trying. Use the inital-eda branch and discuss what works here. Some places to start:

Julia's new book: https://smltar.com.
LDA.
Do we have a marketing/advertising/general business library of terms to search by?
Word embeddings.

wtrumanrose commented 2 years ago

@marcdotson Latest push is a function I've been cooking up to get some summary info (Company, Quarter, Word Count, and Date). This may or may not be redundant, but it's a start.

As for dictionaries, there's a package called 'edgar' which is highly useful. It includes the Loughran-McDonald main dictionary, as well as a bunch of tools to use it. Here's the link: https://search.r-project.org/CRAN/refmans/edgar/html/00Index.html

I'm going to work on importing the stop word lists as well as searching for other useful packages with relevant dictionaries.

marcdotson commented 2 years ago

@wtrumanrose I think I fixed the issue with the line returns. Carriage returns are special characters \n that unnest_tokens() removes like punctuation or HTML, so I first replace them with spaces so the words at the end and beginning of each line don't get concatenated.

Try running 01_import-data again and see if that addresses it for your visualizations.

wtrumanrose commented 2 years ago

@marcdotson It works!

On another note, here are some updates. The code currently is super memory intensive, mostly because I'm running the data through a variety of different stop word lexicons. Currently, there are 8 variations of the data:

No stop words
Loughran-McDonald stop words
All Tidytext stop words (SMART, Snowball, and Onix)
SMART
Snowball
Onix
ISO (included in the stopwords package, which also has SMART and Snowball)
All of the stop word lexicons

I initially ran the word embeddings with all the stop words removed, which may be more informative, but excludes a ton of words. I updated it to use none of the stop words, though both extremes aren't ideal for what we will do. If you want, I can do an embedding for each stop word list, but that would really tax the memory, so I figured we might as well just figure out what stop words we want to use first and go from there.

Also, I need to keep looking for a list of business/marketing related words. I don't think Loughran-McDonald has a list of those specific words, but I need to double check.

My brain hurts from trying to get ggplot2 to work, but everything is mostly ready to plot. I'm not sure how fancy you want to get with it, I was just going to do a whole mess of word counts.

marcdotson commented 2 years ago

@wtrumanrose stick with no stop words for now and produce the visualizations. Fancy is the enemy at the moment, so let's just get some word counts put together. It's time we start producing the report as well, so I'll outline that.

marcdotson commented 2 years ago

@wtrumanrose added a Data section to the renamed paper template right after the Introduction. Let's just work on it there: earnings-calls.Rmd in Writing.

wtrumanrose commented 2 years ago

@marcdotson Remember how I said the main dictionary didn't have what we needed? Turns out I'm a liar. The dictionary from 2011 is contained within the main dictionary, although the csv labels it as 2009. The dictionary is already in the edgar package, all we need to do is filter by 2009, and we get the finance words they used in the Manager Sentiment paper. I need to go through the 2011 Loughran and McDonald paper again, but we should at least have finance/business terms. We still need marketing specific terms, however.

marcdotson commented 2 years ago

@wtrumanrose good. See if there are marketing terms as a subset of those finance/business terms.

wtrumanrose commented 2 years ago

@marcdotson will do. This paper also discusses three other word lists in section 3.2 , as well as examples of where they were used. I'll look into those as well, but it seems Loughran McDonald is the best place to start for now.

marcdotson commented 2 years ago

@wtrumanrose okay, I've pushed changes to 01_import-data.

quarter and year are extracted from the title so the join with the firm performance data is exactly correct.
date is now call_date to distinguish it from quarter and year, which may not match.
Filtering non-earnings calls more reliably, including removing most duplicates, at least so far.
Write a single file, call_data.rds that has everything.

wtrumanrose commented 2 years ago

@marcdotson 01_import-data, when you're cleaning up the year/quarter, should the be a "d" at the end of (S|s)econ?

marcdotson commented 2 years ago

Yep thanks.

wtrumanrose commented 2 years ago

@marcdotson Most recent push imports Loughran McDonald word list and makes it binary--I was attempting to make some visualizations but nothing really came from it. The positive/negative columns are orthogonal, but the uncertainty, litigious, and modal aren't, which means there is some overlap. As far as word lists go, I have yet to find any that are explicitly tailored to marketing.

marcdotson commented 2 years ago

@wtrumanrose let's not close this until we're done with the initial EDA and have something to report.

I'll be gone this week for a conference, but I've started the process of working with the entire dataset, so hopefully what you've set up for visualizations and word embeddings is ready to scale.

wtrumanrose commented 2 years ago

@marcdotson whoops--I didn't realize I hit close with comment, my bad. Is there any other groupings I should consider for visualizations (e.g., by company, by quarter)? Or just by year?

Also, when I run the import data code, I only get the Apple earnings calls. Is this intended?

marcdotson commented 2 years ago

@wtrumanrose I'm not done with it yet, but I've replaced the text in the shared folder with a single transcripts.rds that has everything.

wtrumanrose commented 2 years ago

@marcdotson I made a dashboard in the latest push--let me know what you think. The transcript.rds file took quite awhile to load, so I wasn't able to clear my environment and rerun everything like I normally do--there might be some errors. Tomorrow I'm going to make sure everything is able to be scaled up and see if I can clean up some of the code.

marcdotson commented 2 years ago

@wtrumanrose awesome use of a dashboard! A few things:

I need to finish figuring out how to use all of the data. Once we get this working for all of the data, this will be a great way to look at some of the options you've specified.
Are we any closer to identifying marketing terms as a subset of the dictionary of terms? I don't see any obvious ones in the most commonly used words, which is probably expected. If there isn't anything obvious in the dictionary, I'd recommend randomly selecting and reading some of the earnings calls transcripts in their entirety to get a sense of the flow and where marketing terms might be used.

wtrumanrose commented 2 years ago

@marcdotson Academic papers seem to be a bust, so I've expanded to just a plain ol google search. I found this, which might be a good place to start, but I would have to scrape it all (which wouldn't be hard).

https://marketing-dictionary.org/

There's quite a few other term lists like this, but this one seems to be the most comprehensive and legitimate. I'll probably try to read through a few transcripts as you recommended; it seems like we will have to exercise a good deal of judgement regardless of how we do it, and understanding the transcript language would only help better inform those judgements

marcdotson commented 2 years ago

@wtrumanrose the challenge is the dictionary needs to be vetted specifically for earnings calls, and I don't think this fits the bill. I still think that the marketing terms we want are a subset of the existing earnings calls dictionary.

I worked a bit during my conference trip, but I kept running into the same sort of problems. I think we're both going about this awkwardly. I'm going to go through the entire list of transcript names to identify needed patterns to finish extracting all the information we need. If you would please start reading through some of the transcripts and start to identify what kind of language is being used that might be marketing related within the context of earnings calls.

wtrumanrose commented 2 years ago

@marcdotson Are you going to be here today? I read through a couple, and it would be nice to talk through them.

wtrumanrose commented 2 years ago

@marcdotson I apologize I wasn't able to get this pushed sooner--I had a flat tire I was dealing with. I went through the dictionary and pulled ~160 marketing terms, although I'm not super confident in the list. I tried to stick to terms that are pretty unambiguously marketing, which, as it turns out, isn't very easy to do in some cases. I updated the dashboard to reflect those word counts as well. The csv is in the google drive.

marcdotson commented 2 years ago

@wtrumanrose thanks again for your help. I've made what I think is the most accurate pass I can. We have a complete dataset. I'm uploading call_data.rds to the shared drive so you don't have to run it again. Let's get on what we discussed and I'll start figuring out how to use what we have for supervised learning.

wtrumanrose commented 2 years ago

@marcdotson awesome opossum, I'm working with it now. Right now I'm working on actually getting the tokenizing done--I neglected to sufficiently address the scalability of the code, and I'm paying the price for it. I have a couple ideas for dealing with this though, so hopefully it won't take long; worst case is I just split up the . Also, I probably won't go to the Tanner building today, but if you want to chat just let me know and I can hop on zoom. I'm planning on completing:

[x] Scaled up data wrangling (tokenizing, stop words, etc.)
[ ] Word embeddings (I might need some guidance on judgement calls with this)
[ ] Updated dashboard featuring word embeddings (tentative plan is to visualize the top co-occuring/related words to the top marketing words by year--like you said, basically a cooler word cloud).

If I have time, I'm going to review the marketing words dictionary and rework the sentiment and revenue aspects of the dashboard. I'll update this list as I work.

Word Embedding Decisions

Following 5.2 of the SMLTAR book for word embeddings is straightforward, but requires some judgement which could likely influence our results. These decisions involve arguments of functions Your input would be greatly appreciated on the following:

SMLTAR first filters out infrequent tokens--in their example, they did it by n >= 50. Their corpus was ~117,000 documents compared to our ~177,000, and their documents are significantly shorter than ours, likely ranging from 50-200 words. Filtering by more words is certainly better for computational reasons, but we lose some information.
Perhaps the most influential on our results is the window size argument. A smaller window size seems to focus more on linguistic/semantic properties of words, while larger window sizes provide more information on the topical context. Again, there's a computational tradeoff as well, with larger window sizes being more computationally intensive.
Finally, the widely_svd() function in the widyr package has a couple of arguments which I believe are related to the dimensionality of the vectors. These arguments are nv and maxit, but I don't understand enough about them to understand what might be optimal. I'll look through the documentation later, but for now I might just use the 100/1000 parameters they set.

marcdotson commented 2 years ago

@wtrumanrose awesome. I'll have to review word embeddings before I can get back to you on that. This all sounds good otherwise. I'll get the marketing terms list to the co-authors to review if you can get me a visualization of the most used words overall once the tokenizing is working.

wtrumanrose commented 2 years ago

@marcdotson Tokenization took a lot longer than expected--I had to slice the documents into 4 groups, remove all the stop words from each separately, then found all the words which occurred less than 100 times over all 4 of the documents and filtered those out. It looks like there are still some spacing problems like before, so I'm going to go back soon and look through it, but the words which are stuck together should be essentially random, so I don't think it will be influential on our results. Even after the extreme pruning, the full data frame has ~400,000,000 rows. I started the word embeddings code late last night, and it's still going, I have no idea when it will finish though. I should be able to get the dashboard code set up at least, so we would just be waiting on the embeddings to finish.

marcdotson commented 2 years ago

@wtrumanrose here are the changes I've made/pushed:

Cleaned up the shared Data folder to include the latest files.
call_data.rds now includes industry information using the GICS standards. From aggregate to disaggregate, they are: sector, group, industry, and sub_industry.
Dropped the just under 1,000 observations without revenue data.
Included an earnings call id.
Added conversion to UTF-8.
Split 02_data-wrangling into 02_word-counts and 03_word-embeddings.
Combined title and text (including stripping punctuation and cleaning up contractions), tokenized, and removed the generic stop words.
Left a whole bunch of code in purgatory at the end of 02_word-counts.
Renamed the files to: 01_data-wrangling, 02_tokenizing, 03_word-embeddings, and 04_eda.

I can’t save the word_tokens.rds as a file. I don’t have enough space on my hard drive to write it and then compress it, which is ridiculous, but I’m out of ideas here, so I have the code in 02_tokenizing to reconstruct it. If you can’t save it, great, add it to the shared folder.

As you work through the code, match the commenting and coding style I've done in 01_data-wrangling especially. The comments are something of an outline of what we might eventually put in the methods or appendix of the paper.

marcdotson commented 2 years ago

@wtrumanrose the problem is this when writing the tokens, and maybe you've run into it, but .rds is compressed. However, you need more space than the actual .rds file is to write it initially since it appears to go through that compression process as it's writing.

That's my best guess. However, 500GB of space was insufficient. I plugged in the old iMac and ran it, which has more hard drive space since it's not all SSD, and 2.2TB of space were insufficient.

So I'm at a loss. Sorry to leave the bag hanging, but you figured it out once before, hopefully you can figure it out this time as well. I've set up loops so if you need to do the tokens processing in more than 4 steps, you just need to change one variable. Keep me posted!

wtrumanrose commented 2 years ago

@marcdotson Thank you Marc--this is all great. I was able to save word_tokens.csv, which was only 7ish GB. That's strange that you aren't able to save it as an .rds, I'll look into it and see what I can get. I'll update this issue tomorrow with what I've found.

wtrumanrose commented 2 years ago

@marcdotson I spent a good chunk of my day yesterday looking into some dimensionality reduction methods (mostly hashing), but it doesn't seem to be what we need. I think my work is pretty cut out for me though.

Another option for text preprocessing which I've looked into a little bit is the textrecipes package. However, I'm not sure when using a recipe is preferable to "normal" preprocessing--if I remember correctly, recipes are better when you are constantly receiving data, which would be the case for most marketing analytics contexts. I'm not sure about using recipes in academic contexts, but if you'd like for me to set up our current text preprocessing using textrecipes, just let me know.

wtrumanrose commented 2 years ago

@marcdotson I went ahead and reworked 02_tokenizing quite a bit so I could get everything to actually run. Even with all the memory-saving changes I made, it still took me 30 splits to do it, but I have word_tokens.rds now, so that's nice. I'm not sure if you want to keep the changes I made (e.g., writing and deleting a bunch of .rds files, nesting words), but it's what I had to do to get it to work on my system.

I'm going to see if I can keep the words nested for now, as it allows me to keep the firm data and word tokens in the same data frame. If it becomes overly complicated, I'll probably just split up the firm data and the tokens and rejoin them when needed.

marcdotson commented 2 years ago

You can proceed with the rest of the analysis. Whether it’s a .csv or .rds shouldn’t matter as long as there isn’t any weird encoding that gets tacked on with .csv.

-Marc

On Jun 21, 2022, at 1:17 PM, wtrumanrose @.***> wrote:

@marcdotsonhttps://github.com/marcdotson I went ahead and reworked 02_tokenizing quite a bit so I could get everything to actually run. Even with all the memory-saving changes I made, it still took me 30 splits to do it, but I have word_tokens.rds now, so that's nice. I'm not sure if you want to keep the changes I made (e.g., writing and deleting a bunch of .rds files, nesting words), but it's what I had to do to get it to work on my system.

I'm going to see if I can keep the words nested for now, as it allows me to keep the firm data and word tokens in the same data frame. If it becomes overly complicated, I'll probably just split up the firm data and the tokens and rejoin them when needed.

— Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/earnings-calls/issues/3#issuecomment-1162294277, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB6JGKLLPALJFL4XNAJBUTVQIPM7ANCNFSM5Q2EVEGQ. You are receiving this because you were mentioned.Message ID: @.***>

marcdotson commented 2 years ago

@wtrumanrose overall_word_counts and sector_word_counts are in the shared drive. I tried to produce some plots, but it broke. Progress?

wtrumanrose commented 2 years ago

@marcdotson I actually figured out how to make new nested dataframes with the counts of each marketing word in them, but I haven't tried any plotting with them since you've already done the unnested word counts. Do you know what other plots still need to be made? It looks like word counts are taken care of, we could still do regressions I suppose, though I think that would just take the form of a table.

marcdotson commented 2 years ago

I haven’t gotten word counts to plot properly yet. It’s all on the table, including saving images.

-Marc

On Jun 29, 2022, at 1:52 PM, wtrumanrose @.***> wrote:

@marcdotsonhttps://github.com/marcdotson I actually figured out how to make new nested dataframes with the counts of each marketing word in them, but I haven't tried any plotting with them since you've already done the unnested word counts. Do you know what other plots still need to be made? It looks like word counts are taken care of, we could still do regressions I suppose, though I think that would just take the form of a table.

— Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/earnings-calls/issues/3#issuecomment-1170423526, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB6JGLYIWFADHK4MD5A6GDVRSSRHANCNFSM5Q2EVEGQ. You are receiving this because you were mentioned.Message ID: @.***>

wtrumanrose commented 2 years ago

@marcdotson I figured it out--the data was grouped by id (I think), so all I did was add ungroup() |> group_by(word) |> summarise(n = sum(n)) |> to the plots and it worked. I'll upload them to the drive soon, I'm going to see what else I can get done unless you want me to push the changes now.

marcdotson commented 2 years ago

@wtrumanrose nicely done. Stupid group_by(). You should be able to just drop them in Figures, they should be under a MB each, right?

marcdotson commented 2 years ago

@wtrumanrose I'm off the grid starting tomorrow, so I added some notes at the top of each of the sections in 04_eda.R. Overall, we need to do exploration, especially using all those new variables I included. Take a look and see what you can produce, including nice-looking plots (play with the size of plots, etc.), as well as tables of correlations and we'll send an update Monday.

wtrumanrose commented 2 years ago

@marcdotson With the new variables, do you think reviving the dashboard would be the best way to go? Or should we just stick to the plots? There are 12 separate sectors, 25 groups, 69 industries and 157 subindustries, so plotting beyond sectors might get crazy.

I made a new folder called "Figures" for the google drive too, the first 6 plots should be in there now.

marcdotson commented 2 years ago

@wtrumanrose the dashboard would be nice if we could get it to work with data this large.

wtrumanrose commented 2 years ago

@marcdotson Hey Marc--I hope you survived your family reunion. I've been grinding away at the dashboard, but I'm stuck with trying to get it to work. I ended up abandoning flexdashboard and going for a shiny app, but it didn't really solve any of my problems. Do you have time to meet over zoom tomorrow? I can walk you through what I've done so far and we can figure out how to finally put the nail in the coffin on this EDA. I'm going to tidy up some of the code I've been working with and push it tomorrow morning.

marcdotson commented 2 years ago

@wtrumanrose I haven't seen the push, but we need to talk so we have something to show Ryan. Let's hop into my Zoom room at 12:30 pm Mountain.

marcdotson commented 2 years ago

Here's what we need to finish and then we can wrap up initial-eda:

[x] Finish producing word count visualizations, overall and for different GICS subsets.
[x] Finish producing correlations, overall and for different GICS subsets.
[ ] Improve time series visualizations, overall and for different GICS subsets.
[x] Produce sentiment, overall and for different GICS subsets, plus correlations.

Moving into modeling, we need to talk word embeddings (parallelized?) and a lit review, with the "Manager sentiment and stock returns" paper as well as a look through JCP, JCR, MS, JMR, JM, and QME for different techniques used.

wtrumanrose commented 2 years ago

@marcdotson I'm pushing my changes now--I've made good headway into all 4 of those tasks, but there are still some things to finish up. I'll hop on your zoom at 12:30

wtrumanrose commented 2 years ago

@marcdotson The dashboard is pushed and I'm quite pleased with the results. It's not the prettiest thing in the world, but it gets the job done. Feel free to make any changes you see fit or tag me in this issue with things you'd like me to alter or add.

This was the book I referred to the most for making this--you might already be familiar with it. https://mastering-shiny.org/index.html

marcdotson commented 2 years ago

@wtrumanrose oh, that's definitely the book to use. Let's plan on you walking through the dashboard as part of the talk with Ryan today. I'm trying to make sense of the EDA.

marcdotson commented 2 years ago

Include additional outcome variables alongside revenue:

EPS or EPSFXQ (earnings per share).
XADQ isn't available.
Can we compute percentage change in EPS from the previous quarter?

wtrumanrose commented 2 years ago

@marcdotson When you tried to pull XADQ from Compustat, did you exclude all the companies in the Utility sector? The "Manager Sentiment and Stock Returns" paper doesn't use that specific variable, but they do use Compustat and exclude Utility and Financial sector companies. I believe the dictionary mentioned that XADQ wasn't available for Utility companies.

marcdotson commented 2 years ago

There isn't a way to filter by industry in the query. That said, it wouldn't make sense for XADQ to not be an option. If it's not present for Utility companies it should just be populated with missing values.

marcdotson commented 2 years ago

Thanks to Jake's financial database wizardry, we have EPS and analyst forecasts. The call_data and word_tokens now includes the following outcome variables:

revenue (as before) earnings (IBES EPS) forecast (median forecast based on one or more analyst predictions of EPS) difference (earnings - forecast)

Since it only makes sense to look at observations that have outcome values, we drop earnings calls that have missing revenue, earnings, or forecast data.

marcdotson commented 2 years ago

@wtrumanrose once you push whatever changes you're making to the dashboard, we should be ready to close out this branch.

marcdotson / earnings-calls

Initial exploratory data analysis #3