palewire / django-calaccess-scraped-data

A Django app to scrape campaign-finance data from the California Secretary of State’s CAL-ACCESS website
http://django-calaccess.californiacivicdata.org
MIT License
2 stars 2 forks source link

Scrapers being blocked by Imperva CDN service #7

Closed palewire closed 6 years ago

palewire commented 6 years ago

The pages being accessed by this scraper, which contain nothing but public data funded by taxpayers, are now being blocked by the Imperva CDN service. A typical response now looks like this:

<html style="height:100%">
 <head>
  <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
 </head>
 <body style="margin:0px;height:100%">
  <iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=5-21975557-0%200NNN%20RT%281523036235574%201%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&incident_id=543000770030160221-94045064965063893&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 543000770030160221-94045064965063893</iframe>
 </body>
</html>
palewire commented 6 years ago

When visiting the URLs, like this one, I am now asked to check a recaptcha box before I can visit the page.

cephillips commented 6 years ago

not good

Cheryl Phillips cherylephillips3@gmail.com

On Apr 6, 2018, at 10:42 AM, Ben Welsh notifications@github.com wrote:

When visiting the URLs, like this one http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=certified&electNav=62, I am now asked to check a recaptcha box before I can visit the page.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/california-civic-data-coalition/django-calaccess-scraped-data/issues/7#issuecomment-379325731, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhabgtc-gFML8y2bVDBErGH8tDNi4l7ks5tl6kIgaJpZM4TKcKP.

palewire commented 6 years ago

This is what the recaptcha page looks like. I've verified it on two separate computers.

screenshot from 2018-04-06 10-46-02

palewire commented 6 years ago

You'll note that the reason given is that my computers, both them, "may have been infected by malware."

palewire commented 6 years ago

My current IP is 192.187.90.114.

palewire commented 6 years ago

I've requested the URL again from my phone's separate Internet service. It works. This leads me to believe my IP was flagged for running a scraper.

tmp_8132-screenshot_20180406-10545793751834920978216

palewire commented 6 years ago

I've sent an email to our point of contact at the California Secretary of State, David Walker. Here is a copy.

Mr. Walker,

My name is Ben Welsh. I'm a reporter at the Los Angeles Times and also an organizer of the California Civic Data Coalition, a network of journalists, academics and developers that writes free software to make CAL-ACCESS's valuable data easier to access and analyze.

Our computer routines to automatically harvest data from your site via web "scrapers" have recently been blocked by your "Imperva" traffic-management service. I have begun to document the issue in our open-source ticket tracker.

I understand the need to address the threat of DDOS attacks, but I can assure our efforts to access your site with our computer routines are far from that. All the data we're seeking to access is public. And we're only requesting a small number of pages a couple times per day.

If this problem exists for us, it will soon exist for other services built on top of your data, like your department's own Power Search (which also scrapes the site) and the California Target Book.

How can we get this problem solved to restore access to our scrapers and others?

Sincerely,

Ben Welsh

palewire commented 6 years ago

I can verify that the routines running on our production server have also been blocked. It has cached the following result for the page linked above.

<html>
 <head>
  <META NAME="robots" CONTENT="noindex,nofollow">
  <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
  <script>
    (function() { 
    var z="";
    var b
    for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}
    z = z.substring(0,z.length-1);
    eval(eval('String.fromCharCode('+z+')'));
    })();
  </script>
 </head>
 <body>
  <iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"> 
  </iframe>
 </body>
</html>
palewire commented 6 years ago

David Walker has reponded.

Good Afternoon,

The Information Technology Division has been made aware of the issue. Could you please provide me with the affected IP address(es) to whitelist?

David Walker-Moore

cephillips commented 6 years ago

Still odd…

Cheryl Phillips cherylephillips3@gmail.com

On Apr 6, 2018, at 1:49 PM, Ben Welsh notifications@github.com wrote:

David Walker has reponded.

Good Afternoon, The Information Technology Division has been made aware of the issue. Could you please provide me with the affected IP address(es) to whitelist? David Walker-Moore

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/california-civic-data-coalition/django-calaccess-scraped-data/issues/7#issuecomment-379373342, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhabtbtcIhRnzHq4Vw_MTpEOrMdot9Dks5tl9TmgaJpZM4TKcKP.

palewire commented 6 years ago

This morning I sent the following reply to Mr. Walker.

Mr. Walker,

Thank you for the prompt response.

What is the reason your site, which publishes only public data, is blocking our scraper? Is it blocking all other such efforts? I'm eager to have our access restored, but I'm concerned special access limited to our group, to the exclusion of others, runs counter to the public interest we seek to serve. Is there not a permanent solution that keeps access open for all?

Sincerely,

Ben Welsh.

palewire commented 6 years ago

Still seeing this error on my laptop.

# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:27] 
$ rm -rf example/.scraper_cache 

# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:44] 
$ python example/manage.py scrapecalaccess --verbosity=3
Scraping propositions
 Retrieving data for Campaign/Measures/list.aspx?session=2013
 Making a GET request for http://cal-access.sos.ca.gov/Campaign/Measures/list.aspx?session=2013
 Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Measures/list.aspx?session=2013
Processing 0 election cycles.
Scraping election candidates
 Retrieving data for /Campaign/Candidates/list.aspx?view=certified&electNav=93
 Making a GET request for http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=certified&electNav=93
 Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Candidates/list.aspx?view=certified&electNav=93
Processing 0 elections.
Scraping incumbent state officials
 Retrieving data for /Campaign/Candidates/list.aspx?view=incumbent
 Making a GET request for http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=incumbent
 Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Candidates/list.aspx?view=incumbent
Processing 0 elections.

# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:47] 
$ cat example/.scraper_cache/Campaign/Measures/list.aspx\?session=2013 
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=1-22369537-0%200NNN%20RT%281524422086806%203%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c321%2c0%29%20U18&incident_id=444000954651905398-837095199103256049&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 444000954651905398-837095199103256049</iframe></body></html>
palewire commented 6 years ago

I have developed a workaround to this problem. It has not be fixed by the Secretary of State.