Closed codinguncut closed 1 month ago
I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...
I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...
same. I think I'm on my 3rd or 4th iteration ;)
I got the same issue: the metadata file was empty or incomplete after opening the CSV file to check how the download function was going, although the filings were downloaded completely. Then, the item extraction function worked only for filings available on the metadata (not all downloaded filings). I made a small adjustment to check whether a given filing exists before downloading, but writing a line for the metadata still takes time. As a consequence, I had to start the process again to get the complete metadata file before executing the extraction function. It took me some days to wait...
same. I think I'm on my 3rd or 4th iteration ;)
It took me one week to recover metadata of 100k filings, what a good lesson!
When Ctrl-C interrupting edgar-crawler the
filings_metadata_file
often becomes empty with a size of 0. This is likely because.to_csv
is called after every retrieval, and takes a long time to run with 100k+ entries.Suggestions:
- Don't write csv file on every loop iteration
- Prevent
filings_metadata_file
from becoming empty- Alternatively, find a way to skip existing files even when
filings_metadata_file
is empty/ missingfor i, series in enumerate(tqdm(list_of_series, ncols=100)): ... if i % 100 == 0 or i == len(list_of_series) - 1: with tempfile.NamedTemporaryFile(delete=False) as tf: final_df.to_csv(tf, index=False, header=True) tf.close() shutil.copy(tf.name, filings_metadata_filepath) os.remove(tf.name)
I would think of writing a JSON file to save metadata rather than the CSV after each download. Then, we need to check if the JSON file and the filing exist before executing the download function, so we can save too much time.
Thank you everyone for the feedback. Yes, indeed, that has happened to me before once. It's a bit risky to depend on the filings_metadata_file since such scenarios can occur.
I am traveling this month but I'll look into this as soon as possible. If anyone alleviates this issue on their own (by forking the repo and patching it with a solution), feel free to add a PR and I'll merge it.
Hi all! @codinguncut , @hahoangnhan
I fixed the issue in PR #23 .
Now:
edgar-crawler should be ctrl+c safe now (;
When Ctrl-C interrupting edgar-crawler the
filings_metadata_file
often becomes empty with a size of 0. This is likely because.to_csv
is called after every retrieval, and takes a long time to run with 100k+ entries.Suggestions:
filings_metadata_file
from becoming emptyfilings_metadata_file
is empty/ missing