nlpaueb / edgar-crawler

The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.
GNU General Public License v3.0
294 stars 80 forks source link

Issue with extracting items #20

Closed limostrom closed 4 months ago

limostrom commented 4 months ago

Hi - I'm trying to work with the extracted Item 1 Business Descriptions, and I noticed that for many filings the extract_items.py code appears to be cutting off the section prematurely. This usually happens when the text makes a reference to a later section (e.g. Item 1A Risk Factors), where that reference is interpreted by the code to be the header of the next section. One example is this filing: https://www.sec.gov/Archives/edgar/data/872448/000087244813000005/atml-201210k.htm When the code gets to this sentence from "Forward Looking Statements": "... including the risk factors set forth in this discussion and in Item 1A — Risk Factors, and elsewhere in this Form 10-K." it cuts off item_1 at "including the risk factors set forth in this discussion and in" The issue occurs in approximately 10-15% of the filings I've looked at.

I am very inexperienced at working with text data, so I'm not sure how to fix this problem myself. Please let me know if you need more information or if I can help in any way. Thanks!

eloukas commented 4 months ago

Hi @limostrom, we also saw a problem in some other item sections similar to the one you said. We fixed those with PR #21, which probably should also fix your problem. (credits to @Bailefan).

If the thing still occurs, please reopen the issue.