Closed nmcrae85 closed 1 year ago
I think its going to need some browser emulation software
Im still seeing this @robbrad
(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil "https://www.wakefield.gov.uk/site/Where-I-Live-Results?uprn=63161064" { "bins": []
I'll take another look 👀
The body returned is slightly different, it's
'\r\n
\r\n\r\n\r\n\r\n\r\n'curl -s -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36" 'https://www.wakefield.gov.uk/site/Where-I-Live-Results?uprn=63161064'
%Hmmm, looks like parsing with requests may get around it too. I'll update the parser to try it out.
@nmcrae85 here's hoping this works. I've updated how its run too, just so you're aware - you passed UPRN in now with -u as well
Im i doing something wrong here?
python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064"
Traceback (most recent call last):
File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 6, in
Sorry my bad, moved directories and forgot to install dependancies :)
However, still this
(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064" { "bins": [] }
No worries, everyone gets module not found sometimes (even me, and I've written lots of these 😂).
If you want, you can try use this branch in my repo - it's what the current PR is for 😊
:) makes me feel better.
Im still getting this tho.
{ "bins": [] }
With the new commits, im seeing these errors now?
(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064"
/Users/neil.mcrae/Library/Caches/pypoetry/virtualenvs/uk-bin-collection-SJQL-SeG-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.wakefield.gov.uk'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
Traceback (most recent call last):
File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 71, in
@nmcrae85 This is related to Wakefield Council's firewall blocking the scraper, and is unfortunately a wontfix issue.
@dp247 are they blocking the IP or the session?
With requests you can make a session using cookies
Looks like IP for me
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: Robert Bradley @.> Sent: Wednesday, January 11, 2023 3:22:11 PM To: robbrad/UKBinCollectionData @.> Cc: David Park @.>; Mention @.> Subject: Re: [robbrad/UKBinCollectionData] Wakefield Council (Issue #152)
@dp247https://github.com/dp247 are they blocking the IP or the session?
With requests you can make a session using cookies
— Reply to this email directly, view it on GitHubhttps://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1378934994, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFFH562O3HFGK7XZBDK53LWR3F2HANCNFSM6AAAAAATQXP3L4. You are receiving this because you were mentioned.Message ID: @.***>
@nmcrae85 - reduce your scrape time down to once every 24 hrs or weekly if that works - you should be able to do this on Home Assistant https://www.home-assistant.io/integrations/rest/#scan_interval
I cant event get it to work without being inside HA. I have just been trying via term on mac
On Wed, 11 Jan 2023 at 20:24, Robert Bradley @.***> wrote:
@nmcrae85 https://github.com/nmcrae85 - reduce your scrape time down to once every 24 hrs or weekly if that works - you should be able to do this on Home Assistant https://www.home-assistant.io/integrations/rest/#scan_interval
— Reply to this email directly, view it on GitHub https://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1379441012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUULJCFHSLHCVW4ZO3ZQOLWR4JHDANCNFSM6AAAAAATQXP3L4 . You are receiving this because you were mentioned.Message ID: @.***>
Does it work from the website? As in if you were just using it like a typical user?
If not does it work from another internet connection?
It does via a web browser from the same network yes.
On Wed, 11 Jan 2023 at 21:32, Robert Bradley @.***> wrote:
Does it work from the website? As in if you were just using it like a typical user?
If not does it work from another internet connection?
— Reply to this email directly, view it on GitHub https://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1379510901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUULJD6IESLSMG3TSZR3VTWR4RGRANCNFSM6AAAAAATQXP3L4 . You are receiving this because you were mentioned.Message ID: @.***>
@nmcrae85 - its not good news im afraid. So what's happening is when you go via your web browser you get the captcha from the imperva which is Wakefield councils anti scraping capability. As a human you answer this and it sets a cookie with a value visid_incap_2049675=<GUID>
- this then continues to work for your session - now when the Python grabs the page it cant do the imperva captura so cant set the cookie value and fails.
@dp247 : FYI
Wakefield council must have a real issue with scraping to be able to fund putting in such a system (and im sure its not bin data that they are trying to govern - more than likely property data etc) - I would encourage you reach out to Wakefield council and ask them if you can get access to an API for your data - This way they can rate limit you based on an api key and you get the data as you need it.
@dp247 What are your thoughts on a list in the wiki for councils like this ?
@dp247 What are your thoughts on a list in the wiki for councils like this ?
I was thinking of a "not currently supported list" for ones that are hiding behind firewalls, aye. Either in the wiki or in the readme, so people don't open duplicates
Name of Council
Wakefield Council
Example Postcode
WF15SL
Additional Information
Current integration doesnt seem to provide an output. Im just getting {"bins": []} returned.