Open jakeschurch opened 7 years ago
As you noted in your fourth bullet point, the amount of boilerplate included in SEC filings makes it fairly hard to extract meaningful measures of sentiment. However, a fair bit of work has been done by Loughran and McDonald - see their website and their 2011 (I think) paper on this. They have a sentiment dictionary, with terms tagged for what they might mean in the context of an SEC filing and they make it available on the website. Probably worth a look, although would be easy to get out of scope if we go down that path.
One thing to note here is that this tool needs to support broad and flexible semantic analysis of various bodies of documents - not just SEC filings. So while one of my current projects focuses heavily on sentiment, our development efforts should not tilt too far in that direction. In our initial conversation about this I recall talking quite a bit about a model that would interface with plugins to be developed later as the need arises. I was envisioning two types - one type for pulling in a specific class of document, i.e. SEC filings, conference call transcripts, FOMC minutes, etc.; and another type for doing different types of analysis of the sort supported by NLTK and other similar packages. There is a Stanford has a toolkit that does some similar things and I believe there are a few others too, but so far most of my experience with this has been with NLTK. If this makes sense, then there are obvious design implications that suggest interfaces to these two types of plugins based on clearly defined generic class structures.
Primary packages for this project will be nltk, pandas, numpy, and scikit learn.
Had no problem reading in a sample 10-k filing as a .txt and tokenizing the text.
Was also able to create simple frequency counts for each token. Large majority of the text consisted of stop words. The plan is to use nltk's built in stop words, however, I'm also going to need to include additional words With this initial filing, some of the most common tokens includes ' ', '0', the company's name, and numbers. These values make up more than 40% of the tokens in this particular sample. Shouldn't be difficult to increase the list of stop words, I can use frequency counts so it is a repeatable process for each unique filing.
Using a basic conditional I was able to pull tokens where len > x, and this resulted in more 'sentiment rich' content (ex: excerpts from Management's Discussion sections. The len value I used was 15 characters). I think this will be where most of the sentiment value can be found and nltk did a surprisingly good job at tokenizing the information so as to not separate important paragraphs, and sentences.
Biggest challenges I see are removing unrelated tokens.
Biggest opportunities will be in using regex expressions to create a sentiment score for each section of the filing and weighting each of these sentiment values to provide an overall outlook.
Questions: In my previous NLP experience, there was a correct value to compare the prediction with. For example, I classified text messages as either spam or not spam, and the data set included the correct answer (spam or not spam). This way I could measure the accuracy of the model. What is the appropriate way to measure the accuracy of the sentiment value for each filing? Is it equity performance within the next 5 minutes, day, week? @dlouton