Open mlandauer opened 1 year ago
hmm strange - I had created an issue see #823 and looks like it had been failing for some time already. I'm puzzled
My notes from #823 reporting this scraper and initial investigation.
Injecting scraper and running...
/app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:329:in fetch': 429 => Net::HTTPTooManyRequests for https://cdn.plan.sa.gov.au/public-notifications/getpublicnoticessummary -- unhandled response (Mechanize::ResponseCodeError) from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:998:in
response_redirect'
from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:321:in fetch' from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize.rb:1323:in
post_form'
from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize.rb:536:in post' from scraper.rb:9:in
@mlandauer it's a big one, please can you take a look next time you're in scraper-land?
@jamezpolley could you go ahead and see what's causing the error? This is a high priority :)
As far as I can tell, the site is always returning a 429
, but includes a page body with some javascript. Executing the javascript sets some session variables and then reloads this page, which this time loads.
I suspect that to work with this we'll need to execute the javascript, as hinted at by https://morph.io/documentation/scraping_javascript_sites. However, we'll need to also execute the script on the page; the example doesn't seem to cover that.
I've started work on this - https://morph.io/jamezpolley/saplanningportal. capybara is getting the page contents, but not executing the script:
/app/vendor/ruby-2.7.0/lib/ruby/2.7.0/json/common.rb:156:in `parse': 783: unexpected token at '<html><head></head><body><script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz=0wPSPsBPz5PEXJYyv4U3GUemm7MGgTj8kTp46Zqfcv7xTaYuMluvTZMGhGcdCITspZDyD1FGLzlbgIQMUglSiaOaJH1R0cJninBXsfCd2WsFJaQOzeWIm2xOFzjnin6ubBOG2rhhNPPpfz8QWjLiTyVOkJLgP8n1&x-kpsdk-im=CiRmZGJiMjFjYy1hM2RiLTRhMzktYjJlMi1lZTAzZGNiNDJiNDg"></script><iframe src="javascript:;" style="display: none;"></iframe></body></html>' (JSON::ParserError)
from /app/vendor/ruby-2.7.0/lib/ruby/2.7.0/json/common.rb:156:in `parse'
from scraper.rb:26:in `<main>'
Someone who lives in the area has written to SA asking for Planning Alerts to have access again
Missive conversation: https://mail.missiveapp.com/#inbox/conversations/3820d053-199a-46d8-8bef-55362e80b02c
Missive conversation: https://mail.missiveapp.com/#inbox/conversations/d1788c04-1804-42d8-a233-7f68bdcb2ced
Missive conversation: https://mail.missiveapp.com/#inbox/conversations/499869b9-8965-4e84-bf92-71a0228ddae6
A while ago they put some anti-scraping tech (Kasada) in place. On 7 Feb 2024 I found an endpoint that wasn't being "protected" by Kasada. For a while that fixed things I think but not for long. Now that endpoint is being blocked too. It's all too boring. I'm not going to put any more effort into this.
This issue has been automatically created by PlanningAlerts. Only close this issue once the authority is working again on PlanningAlerts.