A package for downloading, extracting, parsing, and processing data from SEC-EDGAR, a public online database of all documents filed with the USA's Securities and Exchange Commission.
MIT License
9
stars
3
forks
source link
Create new function that reimplements a combined ```unpack_bulk``` and ```featurize_file```. #142
Basically, unpack_bulk saves intermediate files, and featurize_file with low memory mode loads those files.
If we just kept them in memory, our pipeline would probably double in speed as we are I/O bound, and we wouldn't need to write/read from disk nearly as often. We could even 'skip' unpacking files that we don't want to featurize in an 'early_quit' unpacking that depends on the featurization document_type.
Basically,
unpack_bulk
saves intermediate files, andfeaturize_file
with low memory mode loads those files.If we just kept them in memory, our pipeline would probably double in speed as we are I/O bound, and we wouldn't need to write/read from disk nearly as often. We could even 'skip' unpacking files that we don't want to featurize in an 'early_quit' unpacking that depends on the featurization document_type.