john-friedman / datamule-python

A package to work with SEC data. Incorporates datamule endpoints.
MIT License
73 stars 7 forks source link

errors in parser #19

Closed firmai closed 5 days ago

firmai commented 5 days ago

I have a really good way for you to track the errors in your 10-k parse, basically here you can see there are some really be anomalies when you count the words year to year.

errors_aapl.csv

Some are expected differences, like no risk before 06, but for example this management discussion is an anomaly

image

This seem to be related to it picking up part 1 in the text and cutting it off after that. Here is what you probably want to do as a form of validation, when you for example found the location start of ITEM7A_MARKET_RISK_DISCLOSURES take the words between [ITEM7A_MARKET_RISK_DISCLOSURES, ITEM7_MANAGEMENT_DISCUSSION] and make sure it is similar in size. Also perhaps this could not only solve as a validation mechanism, but also if you have text from end of ITEM6_RESERVED to start of ITEM7A_MARKET_RISK_DISCLOSURES that is larger than ITEM7_MANAGEMENT_DISCUSSION, then perhaps to instead use that text for ITEM7_MANAGEMENT_DISCUSSION. So in short this method could help for validation and for improving the parser.

Appendix Some screenshot for 2005

image

And for 2006

image
john-friedman commented 5 days ago

Hi @firmai, these problems should be fixed in the generalized parser coming out soon. (It got delayed by logistics - flew to SF for ODF23, interviews w/accelerators, etc)

I'm going to close this thread, and will dm you when the update is out - let's chat soon.