jamesqo / gun-violence-data

A comprehensive, accessible database that contains records of over 260k US gun violence incidents from January 2013 to March 2018.
5 stars 2 forks source link

503 error on stage 2 #10

Open rdeaconu opened 4 years ago

rdeaconu commented 4 years ago

I am trying to update this repo to include data up to Sep 2019, but it seems that I'm not able to complete fields given the incident_id in stage2. The code goes through an infinite loop of 503 server errors when accessing the http://www.gunviolencearchive.org/incident/... links. I'm assuming that's because you are allowing coroutines to execute while calling the get(url) function on line 58 in stage2_session.py: resp = await self._sess.get(url) I'm not familiar enough with asyncio and was curious how you managed to overcome this, or if you'd know a workaround to allow some delay between coroutines to avoid detection. Trying to syncronize self._sess.get(url) runs into a series or other errors.

zcahhad commented 4 years ago

I am trying to update this repo to include data up to Sep 2019, but it seems that I'm not able to complete fields given the incident_id in stage2. The code goes through an infinite loop of 503 server errors when accessing the http://www.gunviolencearchive.org/incident/... links. I'm assuming that's because you are allowing coroutines to execute while calling the get(url) function on line 58 in stage2_session.py: resp = await self._sess.get(url) I'm not familiar enough with asyncio and was curious how you managed to overcome this, or if you'd know a workaround to allow some delay between coroutines to avoid detection. Trying to syncronize self._sess.get(url) runs into a series or other errors.

Did you manage to get this sorted at the end? Looking to update the data too

rdeaconu commented 4 years ago

Unfortunately not, I'm assuming the problem is caused by the lack of a delay for requests, which makes it easy for the system on their end to repel the requests, but couldn't find a way around this.

TomSelleck commented 4 years ago

Looks like the website introduced some Cloudflare protection - I swapped to using Selenium for this stage - worked for a while but then the site bans your IP.

image