errors in parser - Githubissues

I have a really good way for you to track the errors in your 10-k parse, basically here you can see there are some really be anomalies when you count the words year to year.

errors_aapl.csv

Some are expected differences, like no risk before 06, but for example this management discussion is an anomaly

This seem to be related to it picking up part 1 in the text and cutting it off after that. Here is what you probably want to do as a form of validation, when you for example found the location start of ITEM7A_MARKET_RISK_DISCLOSURES take the words between [ITEM7A_MARKET_RISK_DISCLOSURES, ITEM7_MANAGEMENT_DISCUSSION] and make sure it is similar in size. Also perhaps this could not only solve as a validation mechanism, but also if you have text from end of ITEM6_RESERVED to start of ITEM7A_MARKET_RISK_DISCLOSURES that is larger than ITEM7_MANAGEMENT_DISCUSSION, then perhaps to instead use that text for ITEM7_MANAGEMENT_DISCUSSION. So in short this method could help for validation and for improving the parser.

Appendix Some screenshot for 2005

And for 2006

john-friedman / datamule-python

errors in parser #19