Closed firmai closed 5 days ago
Hi @firmai, these problems should be fixed in the generalized parser coming out soon. (It got delayed by logistics - flew to SF for ODF23, interviews w/accelerators, etc)
I'm going to close this thread, and will dm you when the update is out - let's chat soon.
I have a really good way for you to track the errors in your 10-k parse, basically here you can see there are some really be anomalies when you count the words year to year.
errors_aapl.csv
Some are expected differences, like no risk before 06, but for example this management discussion is an anomaly
This seem to be related to it picking up part 1 in the text and cutting it off after that. Here is what you probably want to do as a form of validation, when you for example found the location start of ITEM7A_MARKET_RISK_DISCLOSURES take the words between [ITEM7A_MARKET_RISK_DISCLOSURES, ITEM7_MANAGEMENT_DISCUSSION] and make sure it is similar in size. Also perhaps this could not only solve as a validation mechanism, but also if you have text from end of ITEM6_RESERVED to start of ITEM7A_MARKET_RISK_DISCLOSURES that is larger than ITEM7_MANAGEMENT_DISCUSSION, then perhaps to instead use that text for ITEM7_MANAGEMENT_DISCUSSION. So in short this method could help for validation and for improving the parser.
Appendix Some screenshot for 2005
And for 2006