planningalerts-scrapers / issues

Only for keeping track of all issues related to scraping
0 stars 0 forks source link

South Australia Planning Portal #824

Open mlandauer opened 1 year ago

mlandauer commented 1 year ago

This issue has been automatically created by PlanningAlerts. Only close this issue once the authority is working again on PlanningAlerts.

katska commented 1 year ago

hmm strange - I had created an issue see #823 and looks like it had been failing for some time already. I'm puzzled

katska commented 1 year ago

My notes from #823 reporting this scraper and initial investigation. Injecting scraper and running... /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:329:in fetch': 429 => Net::HTTPTooManyRequests for https://cdn.plan.sa.gov.au/public-notifications/getpublicnoticessummary -- unhandled response (Mechanize::ResponseCodeError) from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:998:inresponse_redirect' from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:321:in fetch' from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize.rb:1323:inpost_form' from /app/vendor/bundle/ruby/2.5.0/gems/mechanize-2.7.6/lib/mechanize.rb:536:in post' from scraper.rb:9:in

' I can get to the URL noted https://cdn.plan.sa.gov.au/public-notifications/getpublicnoticessummary in the browser just fine.

@mlandauer it's a big one, please can you take a look next time you're in scraper-land?

katska commented 1 year ago

@jamezpolley could you go ahead and see what's causing the error? This is a high priority :)

jamezpolley commented 1 year ago

As far as I can tell, the site is always returning a 429, but includes a page body with some javascript. Executing the javascript sets some session variables and then reloads this page, which this time loads.

I suspect that to work with this we'll need to execute the javascript, as hinted at by https://morph.io/documentation/scraping_javascript_sites. However, we'll need to also execute the script on the page; the example doesn't seem to cover that.

I've started work on this - https://morph.io/jamezpolley/saplanningportal. capybara is getting the page contents, but not executing the script:

/app/vendor/ruby-2.7.0/lib/ruby/2.7.0/json/common.rb:156:in `parse': 783: unexpected token at '<html><head></head><body><script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz=0wPSPsBPz5PEXJYyv4U3GUemm7MGgTj8kTp46Zqfcv7xTaYuMluvTZMGhGcdCITspZDyD1FGLzlbgIQMUglSiaOaJH1R0cJninBXsfCd2WsFJaQOzeWIm2xOFzjnin6ubBOG2rhhNPPpfz8QWjLiTyVOkJLgP8n1&amp;x-kpsdk-im=CiRmZGJiMjFjYy1hM2RiLTRhMzktYjJlMi1lZTAzZGNiNDJiNDg"></script><iframe src="javascript:;" style="display: none;"></iframe></body></html>' (JSON::ParserError)
    from /app/vendor/ruby-2.7.0/lib/ruby/2.7.0/json/common.rb:156:in `parse'
    from scraper.rb:26:in `<main>'
katska commented 8 months ago

Someone who lives in the area has written to SA asking for Planning Alerts to have access again

katska commented 4 months ago

Missive conversation: https://mail.missiveapp.com/#inbox/conversations/3820d053-199a-46d8-8bef-55362e80b02c

katska commented 4 months ago

Missive conversation: https://mail.missiveapp.com/#inbox/conversations/d1788c04-1804-42d8-a233-7f68bdcb2ced

katska commented 4 months ago

Missive conversation: https://mail.missiveapp.com/#inbox/conversations/499869b9-8965-4e84-bf92-71a0228ddae6

mlandauer commented 4 months ago

A while ago they put some anti-scraping tech (Kasada) in place. On 7 Feb 2024 I found an endpoint that wasn't being "protected" by Kasada. For a while that fixed things I think but not for long. Now that endpoint is being blocked too. It's all too boring. I'm not going to put any more effort into this.