The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.
It is now possible to crawl and extract info from 8-K reports. To enable this we
Added a new file, item_lists.py, which contains the items each supported report type should have. For 8-K reports, older reports have differently named items, so if a file before a certain date is processed, a different item_list is used.
Added a new function, determine_items_to_extract(), determines which item list needs to be used for which report
Added a new test for 8-K reports, along with fixtures
Restructured the RAW_FILINGS and EXTRACTED_FILINGS folders in order not to mix the different report types. Each report is now saved in a folder named after its report type
Bug-Fixes and Refactoring
While working on this we discovered several problems with the old approach. Multiple changes were implemented in order to make the extraction more reliable and to make the edgar-crawler codebase more readable:
We added the SIGNATURE to the end of all item lists. Before, the last item would contain everything from the start of the item section until the end of the document/report. By using the SIGNATURE as the last 'item section', the actual last item only contains the parts of the report relevant to it.
In rare cases, the SIGNATURE section is not easy to identify. It can still happen that the last item section contains some unwanted text - this should happen in less than 1% of the cases.
For old reports, the exhibits and other files were just appended to the report. These files are now included in the SIGNATURE section.
Whether you want to extract the SIGNATURE section or not can be set with a parameter in the config file. By default, this is set to False.
We added a new parameter, filing_types, to the extract_items section of the config file. Here, you can set which report type you want to extract items for.
We added Item-9C to the list of items to be extracted. This item was added over the course of 2021 and not considered before.
Sometimes, some text was within a span-element. Before, this was removed, now we extract the text and add whitespaces/newlines in case a horizontal/vertical margin was detected (see function handle_spans()).
Fixed some bugs where items could not be detected. Examples: "Item7" (no space), "Item 5(e)" (bracket after number)
We adjusted the tests to print the items which caused problems for each failed report.
We generalized the variable names and comments for more report-types (before, it was more 10-K specific)
Support for 8-K reports
It is now possible to crawl and extract info from 8-K reports. To enable this we
item_lists.py
, which contains the items each supported report type should have. For 8-K reports, older reports have differently named items, so if a file before a certain date is processed, a different item_list is used.determine_items_to_extract()
, determines which item list needs to be used for which reportBug-Fixes and Refactoring
While working on this we discovered several problems with the old approach. Multiple changes were implemented in order to make the extraction more reliable and to make the edgar-crawler codebase more readable:
False
.filing_types
, to the extract_items section of the config file. Here, you can set which report type you want to extract items for.handle_spans()
).