Closed palewire closed 6 years ago
When visiting the URLs, like this one, I am now asked to check a recaptcha box before I can visit the page.
not good
Cheryl Phillips cherylephillips3@gmail.com
On Apr 6, 2018, at 10:42 AM, Ben Welsh notifications@github.com wrote:
When visiting the URLs, like this one http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=certified&electNav=62, I am now asked to check a recaptcha box before I can visit the page.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/california-civic-data-coalition/django-calaccess-scraped-data/issues/7#issuecomment-379325731, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhabgtc-gFML8y2bVDBErGH8tDNi4l7ks5tl6kIgaJpZM4TKcKP.
This is what the recaptcha page looks like. I've verified it on two separate computers.
You'll note that the reason given is that my computers, both them, "may have been infected by malware."
My current IP is 192.187.90.114
.
I've requested the URL again from my phone's separate Internet service. It works. This leads me to believe my IP was flagged for running a scraper.
I've sent an email to our point of contact at the California Secretary of State, David Walker. Here is a copy.
Mr. Walker,
My name is Ben Welsh. I'm a reporter at the Los Angeles Times and also an organizer of the California Civic Data Coalition, a network of journalists, academics and developers that writes free software to make CAL-ACCESS's valuable data easier to access and analyze.
Our computer routines to automatically harvest data from your site via web "scrapers" have recently been blocked by your "Imperva" traffic-management service. I have begun to document the issue in our open-source ticket tracker.
I understand the need to address the threat of DDOS attacks, but I can assure our efforts to access your site with our computer routines are far from that. All the data we're seeking to access is public. And we're only requesting a small number of pages a couple times per day.
If this problem exists for us, it will soon exist for other services built on top of your data, like your department's own Power Search (which also scrapes the site) and the California Target Book.
How can we get this problem solved to restore access to our scrapers and others?
Sincerely,
Ben Welsh
I can verify that the routines running on our production server have also been blocked. It has cached the following result for the page linked above.
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
<script>
(function() {
var z="";
var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D353730343037363831353232393938343633372C31343935303931303130383836363633383137342C393931303936323037313832383136353639392C323132303636222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";
for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}
z = z.substring(0,z.length-1);
eval(eval('String.fromCharCode('+z+')'));
})();
</script>
</head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe">
</iframe>
</body>
</html>
David Walker has reponded.
Good Afternoon,
The Information Technology Division has been made aware of the issue. Could you please provide me with the affected IP address(es) to whitelist?
David Walker-Moore
Still odd…
Cheryl Phillips cherylephillips3@gmail.com
On Apr 6, 2018, at 1:49 PM, Ben Welsh notifications@github.com wrote:
David Walker has reponded.
Good Afternoon, The Information Technology Division has been made aware of the issue. Could you please provide me with the affected IP address(es) to whitelist? David Walker-Moore
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/california-civic-data-coalition/django-calaccess-scraped-data/issues/7#issuecomment-379373342, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhabtbtcIhRnzHq4Vw_MTpEOrMdot9Dks5tl9TmgaJpZM4TKcKP.
This morning I sent the following reply to Mr. Walker.
Mr. Walker,
Thank you for the prompt response.
What is the reason your site, which publishes only public data, is blocking our scraper? Is it blocking all other such efforts? I'm eager to have our access restored, but I'm concerned special access limited to our group, to the exclusion of others, runs counter to the public interest we seek to serve. Is there not a permanent solution that keeps access open for all?
Sincerely,
Ben Welsh.
Still seeing this error on my laptop.
# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:27]
$ rm -rf example/.scraper_cache
# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:44]
$ python example/manage.py scrapecalaccess --verbosity=3
Scraping propositions
Retrieving data for Campaign/Measures/list.aspx?session=2013
Making a GET request for http://cal-access.sos.ca.gov/Campaign/Measures/list.aspx?session=2013
Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Measures/list.aspx?session=2013
Processing 0 election cycles.
Scraping election candidates
Retrieving data for /Campaign/Candidates/list.aspx?view=certified&electNav=93
Making a GET request for http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=certified&electNav=93
Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Candidates/list.aspx?view=certified&electNav=93
Processing 0 elections.
Scraping incumbent state officials
Retrieving data for /Campaign/Candidates/list.aspx?view=incumbent
Making a GET request for http://cal-access.sos.ca.gov/Campaign/Candidates/list.aspx?view=incumbent
Writing to cache /home/palewire/Code/django-calaccess-scraped-data/example/.scraper_cache/Campaign/Candidates/list.aspx?view=incumbent
Processing 0 elections.
# palewire @ bunkerhill in ~/Code/django-calaccess-scraped-data on git:master o [11:34:47]
$ cat example/.scraper_cache/Campaign/Measures/list.aspx\?session=2013
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=1-22369537-0%200NNN%20RT%281524422086806%203%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c321%2c0%29%20U18&incident_id=444000954651905398-837095199103256049&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 444000954651905398-837095199103256049</iframe></body></html>
I have developed a workaround to this problem. It has not be fixed by the Secretary of State.
The pages being accessed by this scraper, which contain nothing but public data funded by taxpayers, are now being blocked by the Imperva CDN service. A typical response now looks like this: