jakeschurch / SEC-Swiss-Army-Knife

MIT License
1 stars 1 forks source link

Phase 2 Implementation Introduction #6

Open jakeschurch opened 7 years ago

jakeschurch commented 7 years ago
dlouton commented 7 years ago

As you noted in your fourth bullet point, the amount of boilerplate included in SEC filings makes it fairly hard to extract meaningful measures of sentiment. However, a fair bit of work has been done by Loughran and McDonald - see their website and their 2011 (I think) paper on this. They have a sentiment dictionary, with terms tagged for what they might mean in the context of an SEC filing and they make it available on the website. Probably worth a look, although would be easy to get out of scope if we go down that path.

One thing to note here is that this tool needs to support broad and flexible semantic analysis of various bodies of documents - not just SEC filings. So while one of my current projects focuses heavily on sentiment, our development efforts should not tilt too far in that direction. In our initial conversation about this I recall talking quite a bit about a model that would interface with plugins to be developed later as the need arises. I was envisioning two types - one type for pulling in a specific class of document, i.e. SEC filings, conference call transcripts, FOMC minutes, etc.; and another type for doing different types of analysis of the sort supported by NLTK and other similar packages. There is a Stanford has a toolkit that does some similar things and I believe there are a few others too, but so far most of my experience with this has been with NLTK. If this makes sense, then there are obvious design implications that suggest interfaces to these two types of plugins based on clearly defined generic class structures.