robbrad / UKBinCollectionData

UK Council Bin Collection Data Parser Outputting Bin Data as a JSON
MIT License
161 stars 95 forks source link

Wakefield Council #152

Closed nmcrae85 closed 1 year ago

nmcrae85 commented 1 year ago

Name of Council

Wakefield Council

Example Postcode

WF15SL

Additional Information

Current integration doesnt seem to provide an output. Im just getting {"bins": []} returned.

nmcrae85 commented 1 year ago

I think its going to need some browser emulation software

nmcrae85 commented 1 year ago

Im still seeing this @robbrad

(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil "https://www.wakefield.gov.uk/site/Where-I-Live-Results?uprn=63161064" { "bins": []

dp247 commented 1 year ago

I'll take another look 👀

nmcrae85 commented 1 year ago

The body returned is slightly different, it's

'\r\n\r\n\r\n\r\n\r\n\r\n'

nmcrae85 commented 1 year ago

curl -s -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36" 'https://www.wakefield.gov.uk/site/Where-I-Live-Results?uprn=63161064'

%
dp247 commented 1 year ago

Hmmm, looks like parsing with requests may get around it too. I'll update the parser to try it out.

dp247 commented 1 year ago

@nmcrae85 here's hoping this works. I've updated how its run too, just so you're aware - you passed UPRN in now with -u as well

nmcrae85 commented 1 year ago

Im i doing something wrong here?

python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064" Traceback (most recent call last): File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 6, in from uk_bin_collection.uk_bin_collection.get_bin_data import AbstractGetBinDataClass ModuleNotFoundError: No module named 'uk_bin_collection'

nmcrae85 commented 1 year ago

Sorry my bad, moved directories and forgot to install dependancies :)

However, still this

(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064" { "bins": [] }

dp247 commented 1 year ago

No worries, everyone gets module not found sometimes (even me, and I've written lots of these 😂).

If you want, you can try use this branch in my repo - it's what the current PR is for 😊

nmcrae85 commented 1 year ago

:) makes me feel better.

Im still getting this tho.

{ "bins": [] }

nmcrae85 commented 1 year ago

With the new commits, im seeing these errors now?

(uk-bin-collection-py3.9) (base) neil.mcrae@Neil-McRaes-MacBook-Pro uk_bin_collection % python collect_data.py WakefieldCityCouncil https://www.wakefield.gov.uk/site/Where-I-Live-Results -u "63161064" /Users/neil.mcrae/Library/Caches/pypoetry/virtualenvs/uk-bin-collection-SJQL-SeG-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.wakefield.gov.uk'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn( Traceback (most recent call last): File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 71, in main(sys.argv[1:]) File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 57, in main return client_code( File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/collect_data.py", line 23, in client_code return get_bin_data_class.template_method(address_url, **kwargs) File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/get_bin_data.py", line 53, in template_method bin_data_dict = self.parse_data( File "/Users/neil.mcrae/Documents/UKBinCollectionData/uk_bin_collection/uk_bin_collection/councils/WakefieldCityCouncil.py", line 62, in parse_data datetime.strptime(soup.select("#ctl00_PlaceHolderMain_Waste_output > div:nth-child(4) > " IndexError: list index out of range

dp247 commented 1 year ago

@nmcrae85 This is related to Wakefield Council's firewall blocking the scraper, and is unfortunately a wontfix issue.

robbrad commented 1 year ago

@dp247 are they blocking the IP or the session?

With requests you can make a session using cookies

dp247 commented 1 year ago

Looks like IP for me

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Robert Bradley @.> Sent: Wednesday, January 11, 2023 3:22:11 PM To: robbrad/UKBinCollectionData @.> Cc: David Park @.>; Mention @.> Subject: Re: [robbrad/UKBinCollectionData] Wakefield Council (Issue #152)

@dp247https://github.com/dp247 are they blocking the IP or the session?

With requests you can make a session using cookies

— Reply to this email directly, view it on GitHubhttps://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1378934994, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFFH562O3HFGK7XZBDK53LWR3F2HANCNFSM6AAAAAATQXP3L4. You are receiving this because you were mentioned.Message ID: @.***>

robbrad commented 1 year ago

@nmcrae85 - reduce your scrape time down to once every 24 hrs or weekly if that works - you should be able to do this on Home Assistant https://www.home-assistant.io/integrations/rest/#scan_interval

nmcrae85 commented 1 year ago

I cant event get it to work without being inside HA. I have just been trying via term on mac

On Wed, 11 Jan 2023 at 20:24, Robert Bradley @.***> wrote:

@nmcrae85 https://github.com/nmcrae85 - reduce your scrape time down to once every 24 hrs or weekly if that works - you should be able to do this on Home Assistant https://www.home-assistant.io/integrations/rest/#scan_interval

— Reply to this email directly, view it on GitHub https://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1379441012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUULJCFHSLHCVW4ZO3ZQOLWR4JHDANCNFSM6AAAAAATQXP3L4 . You are receiving this because you were mentioned.Message ID: @.***>

robbrad commented 1 year ago

Does it work from the website? As in if you were just using it like a typical user?

If not does it work from another internet connection?

nmcrae85 commented 1 year ago

It does via a web browser from the same network yes.

On Wed, 11 Jan 2023 at 21:32, Robert Bradley @.***> wrote:

Does it work from the website? As in if you were just using it like a typical user?

If not does it work from another internet connection?

— Reply to this email directly, view it on GitHub https://github.com/robbrad/UKBinCollectionData/issues/152#issuecomment-1379510901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUULJD6IESLSMG3TSZR3VTWR4RGRANCNFSM6AAAAAATQXP3L4 . You are receiving this because you were mentioned.Message ID: @.***>

robbrad commented 1 year ago

@nmcrae85 - its not good news im afraid. So what's happening is when you go via your web browser you get the captcha from the imperva which is Wakefield councils anti scraping capability. As a human you answer this and it sets a cookie with a value visid_incap_2049675=<GUID> - this then continues to work for your session - now when the Python grabs the page it cant do the imperva captura so cant set the cookie value and fails.

@dp247 : FYI

Wakefield council must have a real issue with scraping to be able to fund putting in such a system (and im sure its not bin data that they are trying to govern - more than likely property data etc) - I would encourage you reach out to Wakefield council and ask them if you can get access to an API for your data - This way they can rate limit you based on an api key and you get the data as you need it.

robbrad commented 1 year ago

@dp247 What are your thoughts on a list in the wiki for councils like this ?

dp247 commented 1 year ago

@dp247 What are your thoughts on a list in the wiki for councils like this ?

I was thinking of a "not currently supported list" for ones that are hiding behind firewalls, aye. Either in the wiki or in the readme, so people don't open duplicates