robbrad / UKBinCollectionData

UK Council Bin Collection Data Parser Outputting Bin Data as a JSON
MIT License
114 stars 75 forks source link

Error while setting up uk_bin_collection platform for sensor - Gedling Borough Council #652

Closed sym0nd0 closed 1 month ago

sym0nd0 commented 4 months ago

Name of Council

Gedling Borough Council

Issue Information

I noticed that all four of my bin entities had started to show Unknown and whilst trying to fix it I uninstalled this custom component and reinstalled it, which is now resulting in the following error when trying to set up a Service for Gedling Borough Council.

The Service appears in HA, but no entities appear to be created.

Logger: homeassistant.components.sensor
Source: helpers/entity_platform.py:350
integration: Sensor (documentation, issues)
First occurred: 11:13:51 AM (1 occurrences)
Last logged: 11:13:51 AM

Error while setting up uk_bin_collection platform for sensor
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 318, in _async_refresh
    self.data = await self._async_update_data()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/uk_bin_collection/sensor.py", line 133, in _async_update_data
    data = await self.hass.async_add_executor_job(self.ukbcd.run)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/collect_data.py", line 96, in run
    return self.client_code(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/collect_data.py", line 115, in client_code
    return get_bin_data_class.template_method(address_url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/get_bin_data.py", line 78, in template_method
    bin_data_dict = self.parse_data(
                    ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/councils/GedlingBoroughCouncil.py", line 78, in parse_data
    bin_data = self.get_manual_data(bin_refuse_calendar, bin_garden_calendar)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/councils/GedlingBoroughCouncil.py", line 1578, in get_manual_data
    output["Garden Bin"] = raw_data["green"][garden]["Garden Bin"]
                           ~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'Garden%20Waste%20A-2023.pdf'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 350, in _async_setup_platform
    await asyncio.shield(awaitable)
  File "/config/custom_components/uk_bin_collection/sensor.py", line 75, in async_setup_entry
    await coordinator.async_config_entry_first_refresh()
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 290, in async_config_entry_first_refresh
    raise ex
homeassistant.exceptions.ConfigEntryNotReady: 'Garden%20Waste%20A-2023.pdf'

and also

This error originated from a custom integration.

Logger: custom_components.uk_bin_collection.sensor
Source: helpers/update_coordinator.py:318
integration: UK Bin Collection Data (documentation, issues)
First occurred: 11:13:51 AM (1 occurrences)
Last logged: 11:13:51 AM

Unexpected error fetching Home data: 'Garden%20Waste%20A-2023.pdf'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 318, in _async_refresh
    self.data = await self._async_update_data()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/uk_bin_collection/sensor.py", line 133, in _async_update_data
    data = await self.hass.async_add_executor_job(self.ukbcd.run)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/collect_data.py", line 96, in run
    return self.client_code(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/collect_data.py", line 115, in client_code
    return get_bin_data_class.template_method(address_url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/get_bin_data.py", line 78, in template_method
    bin_data_dict = self.parse_data(
                    ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/councils/GedlingBoroughCouncil.py", line 78, in parse_data
    bin_data = self.get_manual_data(bin_refuse_calendar, bin_garden_calendar)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uk_bin_collection/uk_bin_collection/councils/GedlingBoroughCouncil.py", line 1578, in get_manual_data
    output["Garden Bin"] = raw_data["green"][garden]["Garden Bin"]
                           ~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'Garden%20Waste%20A-2023.pdf'

I've attempted to look for any potential issue in GedlingBoroughCouncil.py, but my very limited skills with Python have failed me lol. The best I can guess at is that Garden Waste A-2023.pdf should now be Garden Waste A.pdf, but this is only based on being unable to find any trace of Garden Waste A-2023.pdf in the file. It's also been failing validation on Allure for this same reason since 13/03/2024.

Hope that's enough to find the issue. More than happy to test or try anything out that will help.

Thanks.

Verification

dp247 commented 4 months ago

Ah yeah, if I remember correctly, all those PDF names are hardcoded, so they must have been updated. Basically someone just needs to rewrite them all lol

sym0nd0 commented 4 months ago

I'm more than happy to update the data and submit a PR...I'm just not sure what I need to update and where (and worried I'll break something 😂).

I'm guessing it's GedlingBoroughCouncil.py but not sure if I should be adding -2024 to the end of this lot, as they don't currently have a year value

            {
                "Garden%20Waste%20A.pdf": {"Garden Bin": ["04/03/2024", "18/03/2024"]},
                "Garden%20Waste%20B.pdf": {"Garden Bin": ["05/03/2024", "19/03/2024"]},
                "Garden%20Waste%20C.pdf": {"Garden Bin": ["06/03/2024", "20/03/2024"]},
                "Garden%20Waste%20D.pdf": {"Garden Bin": ["07/03/2024", "21/03/2024"]},
                "Garden%20Waste%20E.pdf": {"Garden Bin": ["08/03/2024", "22/03/2024"]},
                "Garden%20Waste%20F.pdf": {"Garden Bin": ["11/03/2024", "25/03/2024"]},
                "Garden%20Waste%20G.pdf": {"Garden Bin": ["12/03/2024", "26/03/2024"]},
                "Garden%20Waste%20H.pdf": {"Garden Bin": ["13/03/2024", "27/03/2024"]},
                "Garden%20Waste%20I.pdf": {"Garden Bin": ["14/03/2024", "28/03/2024"]},
                "Garden%20Waste%20J.pdf": {
                    "Garden Bin": ["01/03/2024", "15/03/2024", "29/03/2024"]
                },

It does pull down the pdf from Gedling's site as Garden Waste A-2023.pdf though.

🤦🏼‍♂️ Or is it just those dates in that code block that need to be update for each pdf for the new bin year?

Sorry, probably being very dumb. 😂

jamesmacwhite commented 2 months ago

With the publication of the PDF calendars converted to iCal under https://www.gbcbincalendars.co.uk/, I was thinking about the data as JSON and did come across an iCal to JSON library which converts iCal to JSON.

I tested it locally and it did produce JSON data from the original iCal file, but there may be issues or bugs I didn't see.

One option for this library to remove the static data problem, could be to publish JSON endpoints of each schedule, allowing for an endpoint that doesn't need to change each year, but would be updated from the origin source when the iCal data source changes.

I don't know if anyone would be interested in updating this parser to leverage JSON endpoints instead?

Ultimately, I feel the format of iCal is probably the best format in terms of functionality and can be used outside of specific integrations in Home Assistant, but if anyone's interested in working with JSON data based off this, I'd be happy to publish JSON endpoints.

jamesmacwhite commented 2 months ago

If anyone is interested, JSON endpoints exist for all schedules now, if someone wanted to potentially parse the iCalendar data into a JSON API for other formats.

Monday:

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_g1_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_g2_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_g3_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_g4_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_a_garden_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_f_garden_bin_schedule.json

Tuesday:

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_g1_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_g2_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_g3_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_g4_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_b_garden_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_tuesday_g_garden_bin_schedule.json

Wednesday:

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_g1_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_g2_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_g3_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_g4_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_c_garden_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_h_garden_bin_schedule.json

Thursday:

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_g1_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_g2_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_g3_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_g4_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_d_garden_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_thursday_i_garden_bin_schedule.json

Friday:

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_g1_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_g2_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_g3_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_g4_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_e_garden_bin_schedule.json
https://www.gbcbincalendars.co.uk/json/gedling_borough_council_friday_j_garden_bin_schedule.json
dp247 commented 2 months ago

This is great - we already have some methods that will generate schedules based on frequencies, which are provided in the RRULE of each collection

"RRULE": "FREQ=WEEKLY;WKST=SU;UNTIL=20241112;INTERVAL=4;BYDAY=TU",

We can just provide an option to add the correct JSON link as the URL, so easy winnings!

(If noone wants to take, I'll happily do this when I've got some time)

jamesmacwhite commented 2 months ago

@dp247 Awesome!

From reviewing the ical2json output. RRULE is present on the recurring date instances, however when changed collection days occur i.e. on a bank holiday, there will be single instances of a VEVENT, as it's a one-off.

The Monday schedules will probably be the best test case for a mixture of this.

https://www.gbcbincalendars.co.uk/json/gedling_borough_council_monday_g1_bin_schedule.json

The JSON data is generated from the ical and kept in sync upon deployments i.e. if the ical file changes, the JSON will be generated and updated with it on a build.

jamesmacwhite commented 2 months ago

If it helps, I've used a wrapper library using ical.js, which can expand RRULE VEVENT instances to provide each individual occurrence, following the rules defined i.e. EXDATE, in addition to one off instances. This means it should require less parsing on this side to get the required data, as the JSON API provided offers collection schedule date with the name.

I wanted to build a HTML front end to view the calendar data for each schedule anyway, as it's something that will be useful and also help cross check any data translated.

The JSON URL endpoints haven't changed but the format is now simply:

[
    {
        "name": "Black Bin Day",
        "type": "black-bin",
        "collectionDate": "2023-12-04",
        "isChangedCollection": false
    },
    {
        "name": "Green Bin Day",
        "type": "green-bin",
        "collectionDate": "2023-12-11",
        "isChangedCollection": false
    },
    {
        "name": "Black Bin Day (Changed Collection)",
        "type": "black-bin",
        "collectionDate": "2023-12-16",
        "isChangedCollection": true
    }
]

Happy to add anything else needed, the Jekyll site uses the JSON data internally also. I found Jekyll was horrible for doing any complex parsing of the original ical2json JSON, so ical.js does the heavy lifting.

jamesmacwhite commented 2 months ago

It's been good to leverage a more advanced ical parsing library as some of the original iCal data provided on a few schedules was wrong on the RRULE data not set to the right end date. The HTML format allows much better visibility of any issues i.e. the collection data being output looking off or not following the proper recurring rule, as the main risk with human translation, is of course. HUMAN! It's been corrected now at least!

I'm mostly happy with what's come of this project that's turned into something bigger, but there is now HTML, iCal and JSON formats available, which hopefully helps open up more options for parsing Gedling's bin collection data.

For test cases if anyone is interested in incorporating this in the future.

# HTML calendar
https://www.gbcbincalendars.co.uk/collections/refuse/wednesday-g2

# iCal as local/subscribe URL
https://www.gbcbincalendars.co.uk/ical/gedling_borough_council_wednesday_g2_bin_schedule.ics

# JSON endpoint
https://www.[gbcbincalendars.co.uk/json/gedling_borough_council_wednesday_g2_bin_schedule.json

The JSON API is possibly the best route, but open to other formats. The main factor is ensuring it is generated from the iCal data, as that's the origin format. At build time the HTML/JSON is generated from this to ensure it is consistent and always in sync.

dp247 commented 1 month ago

@jamesmacwhite: This is honestly amazing work! I've had a think and will probably integrate it like this:

I'd love to use the API you mentioned in #757, but because there's no set way to tie a house number to a schedule (council's fault, not yours of course), think it might be easier and more effective to go for the user intervention route. Does that sound good?

jamesmacwhite commented 1 month ago

Happy for any use of the JSON data you see fit! The original plan was opening up the data from the horrible PDFs and thanks to some great JS libraries, being able to parse iCal to JSON reliably, provided a consistent format to use for other software to use cases.

I ended up basing the identifier around the collection weekday and assigned schedule ID e.g. G1 - G4 or A - J, because it's something that can be parsed out of the origin search results on the email subscribe URL and used as the key to tie the origin data to the iCal data, which then the JSON gets generated off.