jadchaar / sec-edgar-downloader

📈 Download filings from the SEC EDGAR database using Python
https://sec-edgar-downloader.readthedocs.io
MIT License
492 stars 137 forks source link

Add public function for retrieving filing URLs without downloading #32

Open mksamelson opened 4 years ago

mksamelson commented 4 years ago

Would be nice to be able to access the files on-line for scraping as opposed to downloading them all. A feature for just returning filing URLs would be handy

jadchaar commented 4 years ago

Hey @mksamelson, thanks for reaching out and using the tool!

I actually have an internal utility function that does exactly what you are requesting:

env ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sec_edgar_downloader._utils import get_filing_urls_to_download
>>> get_filing_urls_to_download("10-K", "AAPL", 20, "2010-12-31", "2019-12-31", False)
[FilingMetadata(filename='0000320193-19-000119.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019319000119/0000320193-19-000119.txt'), FilingMetadata(filename='0000320193-18-000145.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt'), FilingMetadata(filename='0000320193-17-000070.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/0000320193-17-000070.txt'), FilingMetadata(filename='0001628280-16-020309.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000162828016020309/0001628280-16-020309.txt'), FilingMetadata(filename='0001193125-15-356351.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'), FilingMetadata(filename='0001193125-14-383437.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt'), FilingMetadata(filename='0001193125-13-416534.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312513416534/0001193125-13-416534.txt'), FilingMetadata(filename='0001193125-12-444068.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312512444068/0001193125-12-444068.txt'), FilingMetadata(filename='0001193125-11-282113.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312511282113/0001193125-11-282113.txt'), FilingMetadata(filename='0001193125-10-238044.txt', url='https://www.sec.gov/Archives/edgar/data/320193/000119312510238044/0001193125-10-238044.txt')]

The function sec_edgar_downloader._utils.get_filing_urls_to_download returns a list of FilingMetadata objects, which contain the URL you are looking for. The parameters and interface are exactly the same as the get method, but all parameters are required. Since this is an internal method, I have not gotten around to putting a docstring on it.

Let me know if this helps, or if you would like to see something different implemented in a future release!

mksamelson commented 4 years ago

Thanks this is helpful. It would be great in a future release if you could have a utility that provided URLs of other file formats. Your utility accesses the *.txt document (full filing). If there is a way to 1. list the URLs and 2. download html and xml files that would be great.

The image below show the file you reference (circled in red). The file types highlighted in yellow are also very useful.

image

jadchaar commented 4 years ago

Your request has been noted! This is actually quite related to https://github.com/jadchaar/sec-edgar-downloader/issues/31. When I get a free moment, I will work toward adding this feature!

Originally I created this tool for text parsing purposes, but I have seen a nice influx of users requesting the ability to download XML and HTML versions as well, so this will hopefully be the next feature I work on!

mksamelson commented 4 years ago

Thanks.

Just for additional clarity, the txt files have html tags but often have a lot of other junk that causes issues when trying to use an html/xml parser. So you usually have to resort to regular expressions to parse. However, the raw html and xml files don't have this issue.

jadchaar commented 4 years ago

Thanks for letting me know and thanks for finding a regex workaround in the meantime :).

jadchaar commented 3 years ago

v4 of this package will add the ability to download XML and HTML filing details in addition to the full submission TXT: https://github.com/jadchaar/sec-edgar-downloader/pull/52. I still need to make a public facing function for obtaining the URLs without downloading, but the utility function can still serve this purpose until a public function on the Downloader class is added.

jadchaar commented 3 years ago

Another user requested this functionality in an email to me:

I don't use it to download files. Instead, I use it to generate the full_submission_url, and save the urls. i.e., I modified the Downloader() function so that it returns the filings_to_fetch FilingMetadata object.

As such, I'm wondering, in future versions of sec-edgar-download, can you add an option to return the FilingMetadata object filings_to_fetch?