Closed rpocase closed 2 years ago
For anyone that hits this, I'm able to work around this pretty seamlessly by using requests-random-user-agent. In my use case, the primary issue seems to be the lack of a custom user agent, resulting in automation denial much sooner. If I am not doing an excessive amount of scraping then I can mostly not worry about the rate limiting for now. Just importing the library is enough for any subsequent calls to requests
to get a random user agent assigned.
I'm seeing this error as well. It's throwing an error on line 18 of Edgar.__init__
when trying to parse line 2 ("
_name
and _cik
. I can work around the error (without a code change) by going through NordVPN. @rpocase Do you think you can submit a MR for it?
@joeyism I'd love to, but don't know that I'll find the time to do a proper fix. My workaround has been sufficient for my needs, but I wouldn't recommend introducing it into the base library as the "right" fix.
Just slow it down with time.sleep(10)
and it will work fine.
If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period.
I think there are two issues blended together hear. Rate limiting and user agent string.
I am working on a fix for the user agent string that uses a singleton session object where you can set default headers. This also has the added benefit of reusing a pooled TCP connection.
As far as rate limiting goes the SEC should return a 429 and that should be handled with some sort of back off. Has anyone confirmed that the SEC returns a 429? They are also beta testing RESTful APIs so this may all be moot.
Has anyone confirmed that the SEC returns a 429?
Unless anything has changed since issue creation, they respond with a 403
.
final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm'
good_read = False
while good_read == False:
sleep(3)
try:
user_agent = {'User-agent' : ua.random}
conn = Request(final_url, headers=user_agent)
response = urlopen(conn)
try:
table_response = table_response.read()
good_read = True
finally:
pass
except HTTPError as e:
print( "HTTP Error:", e.code, end=" ")
except URLError as e:
print( "URL Error:", e.reason, end=" ")
except TimeoutError as e:
print( "Timeout Error:", e.reason, end=" ")
I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.
final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm' good_read = False while good_read == False: sleep(3) try: user_agent = {'User-agent' : ua.random} conn = Request(final_url, headers=user_agent) response = urlopen(conn) try: table_response = table_response.read() good_read = True finally: pass except HTTPError as e: print( "HTTP Error:", e.code, end=" ") except URLError as e: print( "URL Error:", e.reason, end=" ") except TimeoutError as e: print( "Timeout Error:", e.reason, end=" ")
I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.
Follow https://www.sec.gov/os/webmaster-faq#developers on how to formulate the headers for user-agent. Even a fake email would work. I think they are doing some regex matching only
final_url = 'https://www.sec.gov/Archives/edgar/data/0000866273/000086627321000088/0000866273-21-000088-index.htm' good_read = False while good_read == False: sleep(3) try: user_agent = {'User-agent' : ua.random} conn = Request(final_url, headers=user_agent) response = urlopen(conn) try: table_response = table_response.read() good_read = True finally: pass except HTTPError as e: print( "HTTP Error:", e.code, end=" ") except URLError as e: print( "URL Error:", e.reason, end=" ") except TimeoutError as e: print( "Timeout Error:", e.reason, end=" ") I am using the above code to scrap filing from the sec.gov website using a random user-agent. for random user-agent I have used a fake user agent package. Still, I am facing HTTP Error 403. What is the solution to avoid 403 errors? It was working before I am have been facing Error 403 for the last few days.
Follow https://www.sec.gov/os/webmaster-faq#developers on how to formulate the headers for user-agent. Even a fake email would work. I think they are doing some regex matching only
Thanks, buddy. A few weeks earlier I came across the article, SEC has restricted their request and in URL one has to define hostname, user agent with email id. I have used my email address to access the filing link and it is now working.
@mahantymanoj Can post your code snippet within a triple back tick ( ``` ) block for proper markup, and how to implement it?
@joeyism
How can we override the built-in requests that Edgar does, using our own request headers
?
@mahantymanoj Can post your code snippet within a triple back tick ( ``` ) block for proper markup, and how to implement it?
def urlRequest(final_url,user_agent):
""" Function is used for URL Request, function use urllib library to request """
conn = Request(final_url, headers=user_agent)
response = urlopen(conn, timeout=20)
return response
def urlRequestHit(link,ua):
good_read = False
while good_read == False:
sleep(5)
try:
user_agent = {'User-agent' : ua, 'Host': 'www.sec.gov'}
table_response = urlRequest(link,user_agent)
try:
table_response = table_response.read()
table_response = table_response.decode('utf-8')
good_read = True
finally:
pass
except HTTPError as e:
print( "HTTP Error:", e.code, end=" ")
except URLError as e:
print( "URL Error:", e.reason, end=" ")
except TimeoutError as e:
print( "Timeout Error:", e.reason, end=" ")
return table_response
#### function call
### ua = <your email address example@gmail.com>
xml_response = urlRequestHit(xml_htm, ua)
I am using an XBRL data file. web scraping all XML links and transforming them to DataFrames.
@mahantymanoj
You need to use the back-ticks
(`), not single quotes
('
) to get correct markup.
@mahantymanoj You need to use the
back-ticks
(), not
single quotes(
'`) to get correct markup.
edit done...
The folks at the SEC published some "guidance" on their FAQ that includes sample request headers. You can find it here.
The SEC website recently (within the last couple months) added rate limiting to their website. Currently, none of this libraries requests properly respond to it. This leads to hard to decode errors and makes this generally much less usable in a scripted fashion. When an IP is detected as needing rate limiting, the SEC website returns a
403
response with a body that looks like the below text.