Closed marcdotson closed 2 years ago
@marcdotson Latest push is a function I've been cooking up to get some summary info (Company, Quarter, Word Count, and Date). This may or may not be redundant, but it's a start.
As for dictionaries, there's a package called 'edgar' which is highly useful. It includes the Loughran-McDonald main dictionary, as well as a bunch of tools to use it. Here's the link: https://search.r-project.org/CRAN/refmans/edgar/html/00Index.html
I'm going to work on importing the stop word lists as well as searching for other useful packages with relevant dictionaries.
@wtrumanrose I think I fixed the issue with the line returns. Carriage returns are special characters \n
that unnest_tokens()
removes like punctuation or HTML, so I first replace them with spaces so the words at the end and beginning of each line don't get concatenated.
Try running 01_import-data
again and see if that addresses it for your visualizations.
@marcdotson It works!
On another note, here are some updates. The code currently is super memory intensive, mostly because I'm running the data through a variety of different stop word lexicons. Currently, there are 8 variations of the data:
stopwords
package, which also has SMART and Snowball)I initially ran the word embeddings with all the stop words removed, which may be more informative, but excludes a ton of words. I updated it to use none of the stop words, though both extremes aren't ideal for what we will do. If you want, I can do an embedding for each stop word list, but that would really tax the memory, so I figured we might as well just figure out what stop words we want to use first and go from there.
Also, I need to keep looking for a list of business/marketing related words. I don't think Loughran-McDonald has a list of those specific words, but I need to double check.
My brain hurts from trying to get ggplot2 to work, but everything is mostly ready to plot. I'm not sure how fancy you want to get with it, I was just going to do a whole mess of word counts.
@wtrumanrose stick with no stop words for now and produce the visualizations. Fancy is the enemy at the moment, so let's just get some word counts put together. It's time we start producing the report as well, so I'll outline that.
@wtrumanrose added a Data section to the renamed paper template right after the Introduction. Let's just work on it there: earnings-calls.Rmd
in Writing.
@marcdotson Remember how I said the main dictionary didn't have what we needed? Turns out I'm a liar. The dictionary from 2011 is contained within the main dictionary, although the csv labels it as 2009. The dictionary is already in the edgar
package, all we need to do is filter by 2009, and we get the finance words they used in the Manager Sentiment paper. I need to go through the 2011 Loughran and McDonald paper again, but we should at least have finance/business terms. We still need marketing specific terms, however.
@wtrumanrose good. See if there are marketing terms as a subset of those finance/business terms.
@marcdotson will do. This paper also discusses three other word lists in section 3.2 , as well as examples of where they were used. I'll look into those as well, but it seems Loughran McDonald is the best place to start for now.
@wtrumanrose okay, I've pushed changes to 01_import-data
.
quarter
and year
are extracted from the title so the join with the firm performance data is exactly correct.date
is now call_date
to distinguish it from quarter
and year
, which may not match.call_data.rds
that has everything.@marcdotson 01_import-data
, when you're cleaning up the year/quarter, should the be a "d" at the end of (S|s)econ
?
Yep thanks.
@marcdotson Most recent push imports Loughran McDonald word list and makes it binary--I was attempting to make some visualizations but nothing really came from it. The positive/negative columns are orthogonal, but the uncertainty, litigious, and modal aren't, which means there is some overlap. As far as word lists go, I have yet to find any that are explicitly tailored to marketing.
@wtrumanrose let's not close this until we're done with the initial EDA and have something to report.
I'll be gone this week for a conference, but I've started the process of working with the entire dataset, so hopefully what you've set up for visualizations and word embeddings is ready to scale.
@marcdotson whoops--I didn't realize I hit close with comment, my bad. Is there any other groupings I should consider for visualizations (e.g., by company, by quarter)? Or just by year?
Also, when I run the import data code, I only get the Apple earnings calls. Is this intended?
@wtrumanrose I'm not done with it yet, but I've replaced the text in the shared folder with a single transcripts.rds
that has everything.
@marcdotson I made a dashboard in the latest push--let me know what you think. The transcript.rds
file took quite awhile to load, so I wasn't able to clear my environment and rerun everything like I normally do--there might be some errors. Tomorrow I'm going to make sure everything is able to be scaled up and see if I can clean up some of the code.
@wtrumanrose awesome use of a dashboard! A few things:
@marcdotson Academic papers seem to be a bust, so I've expanded to just a plain ol google search. I found this, which might be a good place to start, but I would have to scrape it all (which wouldn't be hard).
https://marketing-dictionary.org/
There's quite a few other term lists like this, but this one seems to be the most comprehensive and legitimate. I'll probably try to read through a few transcripts as you recommended; it seems like we will have to exercise a good deal of judgement regardless of how we do it, and understanding the transcript language would only help better inform those judgements
@wtrumanrose the challenge is the dictionary needs to be vetted specifically for earnings calls, and I don't think this fits the bill. I still think that the marketing terms we want are a subset of the existing earnings calls dictionary.
I worked a bit during my conference trip, but I kept running into the same sort of problems. I think we're both going about this awkwardly. I'm going to go through the entire list of transcript names to identify needed patterns to finish extracting all the information we need. If you would please start reading through some of the transcripts and start to identify what kind of language is being used that might be marketing related within the context of earnings calls.
@marcdotson Are you going to be here today? I read through a couple, and it would be nice to talk through them.
@marcdotson I apologize I wasn't able to get this pushed sooner--I had a flat tire I was dealing with. I went through the dictionary and pulled ~160 marketing terms, although I'm not super confident in the list. I tried to stick to terms that are pretty unambiguously marketing, which, as it turns out, isn't very easy to do in some cases. I updated the dashboard to reflect those word counts as well. The csv is in the google drive.
@wtrumanrose thanks again for your help. I've made what I think is the most accurate pass I can. We have a complete dataset. I'm uploading call_data.rds
to the shared drive so you don't have to run it again. Let's get on what we discussed and I'll start figuring out how to use what we have for supervised learning.
@marcdotson awesome opossum, I'm working with it now. Right now I'm working on actually getting the tokenizing done--I neglected to sufficiently address the scalability of the code, and I'm paying the price for it. I have a couple ideas for dealing with this though, so hopefully it won't take long; worst case is I just split up the . Also, I probably won't go to the Tanner building today, but if you want to chat just let me know and I can hop on zoom. I'm planning on completing:
If I have time, I'm going to review the marketing words dictionary and rework the sentiment and revenue aspects of the dashboard. I'll update this list as I work.
Word Embedding Decisions
Following 5.2 of the SMLTAR book for word embeddings is straightforward, but requires some judgement which could likely influence our results. These decisions involve arguments of functions Your input would be greatly appreciated on the following:
SMLTAR first filters out infrequent tokens--in their example, they did it by n >= 50
. Their corpus was ~117,000 documents compared to our ~177,000, and their documents are significantly shorter than ours, likely ranging from 50-200 words. Filtering by more words is certainly better for computational reasons, but we lose some information.
Perhaps the most influential on our results is the window size argument. A smaller window size seems to focus more on linguistic/semantic properties of words, while larger window sizes provide more information on the topical context. Again, there's a computational tradeoff as well, with larger window sizes being more computationally intensive.
Finally, the widely_svd()
function in the widyr
package has a couple of arguments which I believe are related to the dimensionality of the vectors. These arguments are nv
and maxit
, but I don't understand enough about them to understand what might be optimal. I'll look through the documentation later, but for now I might just use the 100/1000 parameters they set.
@wtrumanrose awesome. I'll have to review word embeddings before I can get back to you on that. This all sounds good otherwise. I'll get the marketing terms list to the co-authors to review if you can get me a visualization of the most used words overall once the tokenizing is working.
@marcdotson Tokenization took a lot longer than expected--I had to slice the documents into 4 groups, remove all the stop words from each separately, then found all the words which occurred less than 100 times over all 4 of the documents and filtered those out. It looks like there are still some spacing problems like before, so I'm going to go back soon and look through it, but the words which are stuck together should be essentially random, so I don't think it will be influential on our results. Even after the extreme pruning, the full data frame has ~400,000,000 rows. I started the word embeddings code late last night, and it's still going, I have no idea when it will finish though. I should be able to get the dashboard code set up at least, so we would just be waiting on the embeddings to finish.
@wtrumanrose here are the changes I've made/pushed:
call_data.rds
now includes industry information using the GICS standards. From aggregate to disaggregate, they are: sector
, group
, industry
, and sub_industry
.id
.02_data-wrangling
into 02_word-counts
and 03_word-embeddings
.title
and text
(including stripping punctuation and cleaning up contractions), tokenized, and removed the generic stop words.02_word-counts
.01_data-wrangling
, 02_tokenizing
, 03_word-embeddings
, and 04_eda
.I can’t save the word_tokens.rds
as a file. I don’t have enough space on my hard drive to write it and then compress it, which is ridiculous, but I’m out of ideas here, so I have the code in 02_tokenizing
to reconstruct it. If you can’t save it, great, add it to the shared folder.
As you work through the code, match the commenting and coding style I've done in 01_data-wrangling
especially. The comments are something of an outline of what we might eventually put in the methods or appendix of the paper.
@wtrumanrose the problem is this when writing the tokens, and maybe you've run into it, but .rds
is compressed. However, you need more space than the actual .rds
file is to write it initially since it appears to go through that compression process as it's writing.
That's my best guess. However, 500GB of space was insufficient. I plugged in the old iMac and ran it, which has more hard drive space since it's not all SSD, and 2.2TB of space were insufficient.
So I'm at a loss. Sorry to leave the bag hanging, but you figured it out once before, hopefully you can figure it out this time as well. I've set up loops so if you need to do the tokens processing in more than 4 steps, you just need to change one variable. Keep me posted!
@marcdotson Thank you Marc--this is all great. I was able to save word_tokens.csv
, which was only 7ish GB. That's strange that you aren't able to save it as an .rds
, I'll look into it and see what I can get. I'll update this issue tomorrow with what I've found.
@marcdotson I spent a good chunk of my day yesterday looking into some dimensionality reduction methods (mostly hashing), but it doesn't seem to be what we need. I think my work is pretty cut out for me though.
Another option for text preprocessing which I've looked into a little bit is the textrecipes
package. However, I'm not sure when using a recipe is preferable to "normal" preprocessing--if I remember correctly, recipes are better when you are constantly receiving data, which would be the case for most marketing analytics contexts. I'm not sure about using recipes in academic contexts, but if you'd like for me to set up our current text preprocessing using textrecipes
, just let me know.
@marcdotson I went ahead and reworked 02_tokenizing
quite a bit so I could get everything to actually run. Even with all the memory-saving changes I made, it still took me 30 splits to do it, but I have word_tokens.rds
now, so that's nice. I'm not sure if you want to keep the changes I made (e.g., writing and deleting a bunch of .rds
files, nesting words), but it's what I had to do to get it to work on my system.
I'm going to see if I can keep the words nested for now, as it allows me to keep the firm data and word tokens in the same data frame. If it becomes overly complicated, I'll probably just split up the firm data and the tokens and rejoin them when needed.
You can proceed with the rest of the analysis. Whether it’s a .csv or .rds shouldn’t matter as long as there isn’t any weird encoding that gets tacked on with .csv.
-Marc
On Jun 21, 2022, at 1:17 PM, wtrumanrose @.***> wrote:
@marcdotsonhttps://github.com/marcdotson I went ahead and reworked 02_tokenizing quite a bit so I could get everything to actually run. Even with all the memory-saving changes I made, it still took me 30 splits to do it, but I have word_tokens.rds now, so that's nice. I'm not sure if you want to keep the changes I made (e.g., writing and deleting a bunch of .rds files, nesting words), but it's what I had to do to get it to work on my system.
I'm going to see if I can keep the words nested for now, as it allows me to keep the firm data and word tokens in the same data frame. If it becomes overly complicated, I'll probably just split up the firm data and the tokens and rejoin them when needed.
— Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/earnings-calls/issues/3#issuecomment-1162294277, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB6JGKLLPALJFL4XNAJBUTVQIPM7ANCNFSM5Q2EVEGQ. You are receiving this because you were mentioned.Message ID: @.***>
@wtrumanrose overall_word_counts and sector_word_counts are in the shared drive. I tried to produce some plots, but it broke. Progress?
@marcdotson I actually figured out how to make new nested dataframes with the counts of each marketing word in them, but I haven't tried any plotting with them since you've already done the unnested word counts. Do you know what other plots still need to be made? It looks like word counts are taken care of, we could still do regressions I suppose, though I think that would just take the form of a table.
I haven’t gotten word counts to plot properly yet. It’s all on the table, including saving images.
-Marc
On Jun 29, 2022, at 1:52 PM, wtrumanrose @.***> wrote:
@marcdotsonhttps://github.com/marcdotson I actually figured out how to make new nested dataframes with the counts of each marketing word in them, but I haven't tried any plotting with them since you've already done the unnested word counts. Do you know what other plots still need to be made? It looks like word counts are taken care of, we could still do regressions I suppose, though I think that would just take the form of a table.
— Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/earnings-calls/issues/3#issuecomment-1170423526, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB6JGLYIWFADHK4MD5A6GDVRSSRHANCNFSM5Q2EVEGQ. You are receiving this because you were mentioned.Message ID: @.***>
@marcdotson I figured it out--the data was grouped by id (I think), so all I did was add
ungroup() |> group_by(word) |> summarise(n = sum(n)) |>
to the plots and it worked. I'll upload them to the drive soon, I'm going to see what else I can get done unless you want me to push the changes now.
@wtrumanrose nicely done. Stupid group_by()
. You should be able to just drop them in Figures
, they should be under a MB each, right?
@wtrumanrose I'm off the grid starting tomorrow, so I added some notes at the top of each of the sections in 04_eda.R
. Overall, we need to do exploration, especially using all those new variables I included. Take a look and see what you can produce, including nice-looking plots (play with the size of plots, etc.), as well as tables of correlations and we'll send an update Monday.
@marcdotson With the new variables, do you think reviving the dashboard would be the best way to go? Or should we just stick to the plots? There are 12 separate sectors, 25 groups, 69 industries and 157 subindustries, so plotting beyond sectors might get crazy.
I made a new folder called "Figures" for the google drive too, the first 6 plots should be in there now.
@wtrumanrose the dashboard would be nice if we could get it to work with data this large.
@marcdotson Hey Marc--I hope you survived your family reunion. I've been grinding away at the dashboard, but I'm stuck with trying to get it to work. I ended up abandoning flexdashboard and going for a shiny app, but it didn't really solve any of my problems. Do you have time to meet over zoom tomorrow? I can walk you through what I've done so far and we can figure out how to finally put the nail in the coffin on this EDA. I'm going to tidy up some of the code I've been working with and push it tomorrow morning.
@wtrumanrose I haven't seen the push, but we need to talk so we have something to show Ryan. Let's hop into my Zoom room at 12:30 pm Mountain.
Here's what we need to finish and then we can wrap up initial-eda
:
Moving into modeling, we need to talk word embeddings (parallelized?) and a lit review, with the "Manager sentiment and stock returns" paper as well as a look through JCP, JCR, MS, JMR, JM, and QME for different techniques used.
@marcdotson I'm pushing my changes now--I've made good headway into all 4 of those tasks, but there are still some things to finish up. I'll hop on your zoom at 12:30
@marcdotson The dashboard is pushed and I'm quite pleased with the results. It's not the prettiest thing in the world, but it gets the job done. Feel free to make any changes you see fit or tag me in this issue with things you'd like me to alter or add.
This was the book I referred to the most for making this--you might already be familiar with it. https://mastering-shiny.org/index.html
@wtrumanrose oh, that's definitely the book to use. Let's plan on you walking through the dashboard as part of the talk with Ryan today. I'm trying to make sense of the EDA.
Include additional outcome variables alongside revenue
:
@marcdotson When you tried to pull XADQ from Compustat, did you exclude all the companies in the Utility sector? The "Manager Sentiment and Stock Returns" paper doesn't use that specific variable, but they do use Compustat and exclude Utility and Financial sector companies. I believe the dictionary mentioned that XADQ wasn't available for Utility companies.
There isn't a way to filter by industry in the query. That said, it wouldn't make sense for XADQ to not be an option. If it's not present for Utility companies it should just be populated with missing values.
Thanks to Jake's financial database wizardry, we have EPS and analyst forecasts. The call_data and word_tokens now includes the following outcome variables:
revenue
(as before)
earnings
(IBES EPS)
forecast
(median forecast based on one or more analyst predictions of EPS)
difference
(earnings
- forecast
)
Since it only makes sense to look at observations that have outcome values, we drop earnings calls that have missing revenue, earnings, or forecast data.
@wtrumanrose once you push whatever changes you're making to the dashboard, we should be ready to close out this branch.
There are a number of things we can consider trying. Use the
inital-eda
branch and discuss what works here. Some places to start: