nlpaueb / edgar-crawler

The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.
GNU General Public License v3.0
272 stars 75 forks source link

empty filings metadata #19

Open codinguncut opened 2 months ago

codinguncut commented 2 months ago

When Ctrl-C interrupting edgar-crawler the filings_metadata_file often becomes empty with a size of 0. This is likely because .to_csv is called after every retrieval, and takes a long time to run with 100k+ entries.

Suggestions:

for i, series in enumerate(tqdm(list_of_series, ncols=100)):
    ...
    if i % 100 == 0 or i == len(list_of_series) - 1:
        with tempfile.NamedTemporaryFile(delete=False) as tf:
            final_df.to_csv(tf, index=False, header=True)
            tf.close()
            shutil.copy(tf.name, filings_metadata_filepath)
            os.remove(tf.name)
hahoangnhan commented 2 months ago

I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...

codinguncut commented 2 months ago

I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...

same. I think I'm on my 3rd or 4th iteration ;)

hahoangnhan commented 2 months ago

I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...

same. I think I'm on my 3rd or 4th iteration ;)

It took me one week to recover metadata of 100k filings, what a good lesson!

hahoangnhan commented 2 months ago

When Ctrl-C interrupting edgar-crawler the filings_metadata_file often becomes empty with a size of 0. This is likely because .to_csv is called after every retrieval, and takes a long time to run with 100k+ entries.

Suggestions:

  • Don't write csv file on every loop iteration
  • Prevent filings_metadata_file from becoming empty
  • Alternatively, find a way to skip existing files even when filings_metadata_file is empty/ missing
for i, series in enumerate(tqdm(list_of_series, ncols=100)):
    ...
    if i % 100 == 0 or i == len(list_of_series) - 1:
        with tempfile.NamedTemporaryFile(delete=False) as tf:
            final_df.to_csv(tf, index=False, header=True)
            tf.close()
            shutil.copy(tf.name, filings_metadata_filepath)
            os.remove(tf.name)

I would think of writing a JSON file to save metadata rather than the CSV after each download. Then, we need to check if the JSON file and the filing exist before executing the download function, so we can save too much time.

eloukas commented 2 months ago

Thank you everyone for the feedback. Yes, indeed, that has happened to me before once. It's a bit risky to depend on the filings_metadata_file since such scenarios can occur.

I am traveling this month but I'll look into this as soon as possible. If anyone alleviates this issue on their own (by forking the repo and patching it with a solution), feel free to add a PR and I'll merge it.