kamilkrukowski / EDGAR-DOC-PARSER

A package for downloading, extracting, parsing, and processing data from SEC-EDGAR, a public online database of all documents filed with the USA's Securities and Exchange Commission.
MIT License
9 stars 3 forks source link

Create new function that reimplements a combined ```unpack_bulk``` and ```featurize_file```. #142

Open kamilkrukowski opened 1 year ago

kamilkrukowski commented 1 year ago

Basically, unpack_bulk saves intermediate files, and featurize_file with low memory mode loads those files.

If we just kept them in memory, our pipeline would probably double in speed as we are I/O bound, and we wouldn't need to write/read from disk nearly as often. We could even 'skip' unpacking files that we don't want to featurize in an 'early_quit' unpacking that depends on the featurization document_type.