Parallelism not working

nlpaueb / edgar-crawler

The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.

GNU General Public License v3.0

294 stars 80 forks source link

Parallelism not working #14

Closed ureshvahalia closed 1 year ago

ureshvahalia commented 1 year ago

In edgar_crawler.py, it appears we are trying to issue number of downloads in parallel by creating a list_of_series in main. However, the way get_specific_indices() is coded, it returns a df with one entry per filing, so list_of_series is really a list of single-item entries. Hence crawl is called separately for each filing to download, which seems very slow. Am I missing something? Is this a bug or intentional for some reason?

eloukas commented 1 year ago

Hi @ureshvahalia, this is true: the method is called for a single-item entries.

In a previous software version, I did have multiple-item lists but I found that this started to request the EDGAR database too quick and caused a temporary suspension of my IP. Thus, I've defaulted it to one entry per filing, which is safe according to their policy.