vacanza / python-holidays

Generate and work with holidays in Python
https://pypi.org/project/holidays
MIT License
1.43k stars 457 forks source link

Holiday categories: inconsistent behaviour and ghost categories 🌫️ #1559

Closed tpvasconcelos closed 7 months ago

tpvasconcelos commented 10 months ago

Hey holiday folks!

While exploring this package, I came across a few unexpected and inconsistent behaviours regarding holiday categories. I hope you can shed some light on where I'm going wrong.

Strange behaviour 1

While all entities can declare a set of supported_categories, this is never validated, enforced, or otherwise used anywhere else in the codebase.

$ find . -name '*.py' | xargs grep -nr 'supported_categories'
./holiday_base.py:232:    supported_categories: Set[str] = set()
./countries/south_korea.py:53:    supported_categories = {BANK, PUBLIC}
./countries/bulgaria.py:49:    supported_categories = {PUBLIC, SCHOOL}
./countries/austria.py:22:    supported_categories = {BANK, PUBLIC}
./countries/canada.py:46:    supported_categories = {GOVERNMENT, OPTIONAL, PUBLIC}
./countries/greece.py:32:    supported_categories = {HALF_DAY, PUBLIC}
./countries/indonesia.py:53:    supported_categories = {GOVERNMENT, PUBLIC}
./countries/slovakia.py:29:    supported_categories = {PUBLIC, WORKDAY}
./countries/thailand.py:139:    supported_categories = {ARMED_FORCES, BANK, GOVERNMENT, PUBLIC, SCHOOL, WORKDAY}
./countries/japan.py:34:    supported_categories = {BANK, PUBLIC}
./countries/china.py:65:    supported_categories = {PUBLIC, HALF_DAY}
./countries/laos.py:67:    supported_categories = {BANK, PUBLIC, SCHOOL, WORKDAY}
./countries/brazil.py:61:    supported_categories = {OPTIONAL, PUBLIC}
./countries/liechtenstein.py:30:    supported_categories = {BANK, PUBLIC}
./countries/timor_leste.py:44:    supported_categories = {GOVERNMENT, PUBLIC, WORKDAY}
./countries/uruguay.py:40:    supported_categories = {BANK, PUBLIC}
./countries/portugal.py:50:    supported_categories = {OPTIONAL, PUBLIC}
./countries/tanzania.py:61:    supported_categories = {BANK, PUBLIC}
./countries/belgium.py:30:    supported_categories = {BANK, PUBLIC}

For instance, should this not be checked here:

https://github.com/vacanza/python-holidays/blob/daa4b804a342368b21477636f7c04ca7d86088aa/holidays/holiday_base.py#L320-L324

e.g.,

    unsupported_categories = self.categories - self.supported_categories
    if unsupported_categories:
        raise ValueError(f"Categories are not supported: {unsupported_categories}.")

Strange behaviour 2

The categories field has to be a tuple but when left empty defaults to the {PUBLIC} set. On the other hand, the supported_categories attribute is an empty set by default. So, by default, you always query PUBLIC holidays from an entity with no supported categories.

Should this be standardized such that both categories and supported_categories are sets (or just iterables) and both default to {PUBLIC} (such that, by default, all entities support PUBLIC holidays)?

Strange behaviour 3.1

Sometimes... you can query unsupported categories from entities that do specify supported_categories

>>> holidays.country_holidays("BR", subdiv="TO", years=2023).supported_categories
{'optional', 'public'}
>>> holidays.country_holidays("BR", subdiv="TO", years=2023, categories=("islamic",))
{datetime.date(2023, 3, 18): 'Dia da Autonomia', datetime.date(2023, 9, 8): 'Nossa Senhora da Natividade', datetime.date(2023, 10, 5): 'Criação do Estado'}

I believe this happens because of the logic defined in the base _add_subdiv_holidays() method which tries to register _add_subdiv_to_islamic_holidays() (doesn't register anything in this case) and _add_subdiv_to_holidays() which is called no matter the values the categories tuple.

I'm not sure what the solution should be here but it looks like _add_subdiv_{subdiv}_holidays should default to _add_subdiv_{subdiv}_public_holidays to respect the category filters. I believe this would also be consistent with my suggestion in "Strange behaviour 2".

This becomes even more unintuitive when the user gets different results for different categories:

>>> len(holidays.country_holidays("BR", subdiv="TO", years=2023, categories=("optional",)))
10
>>> len(holidays.country_holidays("BR", subdiv="TO", years=2023, categories=("public",)))
12
>>> len(holidays.country_holidays("BR", subdiv="TO", years=2023, categories=("islamic",)))
3

which would lead the user to believe that these 3 holidays were islamic but they are in-fact "categoryless" or "pan-categorical" since they always get returned no matter the categories passed

>>> len(holidays.country_holidays("BR", subdiv="TO", years=2023, categories=("islamic", "christian", "government")))
3

Strange behaviour 3.2

Other times... you can only query categories that are defined in supported_categories.

>>> # Portugal only supports 'public' and 'optional' holidays
>>> holidays.country_holidays("PT").supported_categories
{'optional', 'public'}

>>> # As expected, not filtering is the same as filtering only on `PUBLIC`
>>> holidays.country_holidays("PT", years=2020) == holidays.country_holidays("PT", years=2020, categories=("public",))
True
>>> holidays.country_holidays("PT", years=2020) == holidays.country_holidays("PT", years=2020, categories=("public", 'optional'))
False

>>> # Filtering on 'optional' also works ✅ 
>>> holidays.country_holidays("PT", years=2020, categories=('optional',))
{datetime.date(2020, 2, 24): 'Carnaval', datetime.date(2020, 6, 13): 'Dia de Santo António', ...

However, this time filtering on non-existing categories returned an empty calendar
>>> holidays.country_holidays("PT", years=2020, categories=("bank",))
{}
>>> holidays.country_holidays("PT", years=2020, categories=("bank", "christian", "government"))
{}

Which happens because Portugal explicitly marks all holidays as either OPTIONAL or PUBLIC (i.e., there are no "categoryless" holidays).

https://github.com/vacanza/python-holidays/blob/daa4b804a342368b21477636f7c04ca7d86088aa/holidays/countries/portugal.py#L194-L208

Possible solutions

In order to make the category filtering behaviour more intuitive and internally consistent, I can only think of the following two options:

  1. First-class support for "categoryless" holidays. The user should be able to control whether these are returned or not. If this is desired, maybe the sentinel value of category=None should be used for this special case.
  2. If you don't want to support "categoryless" holidays, this should be implicitly converted to PUBLIC holidays.

Ideally, this snippet would work exactly the same if the user set categories = holidays.ALL_CATEGORIES.

country_alpha2 = "BR"
subdiv = "TO"
categories = holidays.country_holidays(country_alpha2, subdiv=subdiv).supported_categories
all_holidays = [
    {"category": cat, "date": date, "name": name}
    for cat in categories
    for date, name in holidays.country_holidays(
        country=country_alpha2,
        categories=(cat,),
        years=2023,
        language="en_US",
        subdiv=subdiv,
    ).items()
]

👆 Alternatively, you could also raise a ValueError for unsupported categories as suggested in "Strange behaviour 1".


Sorry for the long post... I hope this helps :)

KJhellico commented 10 months ago

Hi, @tpvasconcelos! Thank you for a very clear and detailed issue description!

For instance, should this not be checked here:

Yes, we will most likely do that. 👍

Should this be standardized such that both categories and supported_categories are sets (or just iterables)

categories made as tuple in order to be immutable, to avoid the following situations.

I'm not sure what the solution should be here but it looks like _addsubdiv{subdiv}_holidays should default to _addsubdiv{subdiv}_public_holidays to respect the category filters.

Exactly! Fixed in #1562.

Which happens because Portugal explicitly marks all holidays as either OPTIONAL or PUBLIC (i.e., there are no "categoryless" holidays).

This is how it should be everywhere. Brazil was simply "lost". 🙂

arkid15r commented 10 months ago

Hi @tpvasconcelos, thanks for posting this!

The PH categories support is quite recent addition to the library and is still WIP. I appreciate the meaningful feedback you've provided. This will definitely help us improve the end-user experience.

It has already been addressed partially in today's v0.37 release (https://github.com/vacanza/python-holidays/pull/1562 by @KJhellico) and I expect the rest to be sorted out in v0.38.

tpvasconcelos commented 9 months ago

Thanks for the quick responses @arkid15r and @KJhellico!

This behaviour can also be seen with holiday subdivisions. However, in this case, I guess that the default behaviour does make sense. i.e., holidays with no explicit subdivision should implicitly apply to all country subdivisions.

That said, I still think that this behaviour does not make sense for categories. If you agree that holidays.PUBLIC should be the default category (which already happens for the categories search field), then we could work towards enforcing it.

Option 1 (explicit)

The explicit approach would be to not allow adding categoryless holidays, forcing the user to explicit mark them as PUBLIC

For instance, this line would be removed:

https://github.com/vacanza/python-holidays/blob/4eed2610c58efb6c967b234672157bbe51ecf0f5/holidays/holiday_base.py#L656

The all instances of _add_subdiv_{subdiv}_holidays() would have to be refactored to _add_subdiv_{subdiv}_public_holidays().

A big downside of this approach is that it introduces a big regression and will likely break lots of user code. If this route is considered, them maybe an initial step towards this would be to add a deprecation warning to _add_subdiv_{subdiv}_holidays() usages. And, in a later version, raise a ValueError and only accept _add_subdiv_{subdiv}_{category}_holidays() definitions.

Option 2 (implicit)

This option avoids breaking lots of user code but would probably warrant careful documentation of the behaviours making it very clear that "public" is treated as the default category throughout the package. I think that this approach would require changing the following lines:

https://github.com/vacanza/python-holidays/blob/4eed2610c58efb6c967b234672157bbe51ecf0f5/holidays/holiday_base.py#L656-L659

to:

method_names = []
if PUBLIC in self.categories:
    method_names.append(f"_add_subdiv_{subdiv}_holidays")

for category in sorted(self.categories):
    method_names.append(f"_add_subdiv_{subdiv}_{category}_holidays")

If you wish, I would be happy to help out 👍

KJhellico commented 9 months ago

That said, I still think that this behaviour does not make sense for categories. If you agree that holidays.PUBLIC should be the default category (which already happens for the categories search field), then we could work towards enforcing it.

That's exactly what we planned - PUBLIC as category supported by default. And over time, all countries will switch to this format (with _populate_xxxx_holidays() instead of _populate()). Perhaps this time has already come...

A big downside of this approach is that it introduces a big regression and will likely break lots of user code.

They are the internal methods of HolidayBase and its descendants, and user code should not call them or depend on them.

tpvasconcelos commented 9 months ago

They are the internal methods of HolidayBase and its descendants, and user code should not call them or depend on them

Do you mean that users should only interact with _populate_{category}_holidays() and never use the _add_subdiv_{subdiv}_{category}_holidays() etc variants?

tpvasconcelos commented 9 months ago

I initially assumed that these methods were free game for users of the holidays package to define custom Holiday classes. I can see now that only _populate() is documented in your docs:

image

tpvasconcelos commented 9 months ago

It would be great to support the creation of arbitrary holidays. This might be a new issue altogether but it doesn't seem like it would be hard to implement. Take, for instance, the example where you want to create a custom holiday class for your company's holidays:

class MyCompanyHolidays(holidays.HolidayBase):
    supported_categories = {"ALL", "CONTRACT", "EXTERNAL"}
    subdivisions = ("EMEA", "APLA", "NA")

    # For everyone in EMEA
    def _add_subdiv_emea_all_holidays(self):
        self._add_holiday_jan_1("New Year")

    # Only for full-time employees in North America
    def _add_subdiv_na_contract_holidays(self):
        self._add_holiday_feb_2("Mental health day")

    # This should fail instead of defaulting to PUBLIC
    def _add_subdiv_apla_holidays(self):
        self._add_holiday_mar_3("A holiday for everyone in APLA")

    # For all external workers (across all subdivisions)
    def _populate_external_holidays(self):
        self._add_holiday_jul_4("Forced time off")

I believe that the only thing in the way of allowing a user to create this is again the unknown_categories check (which is straightforward to fix)

https://github.com/vacanza/python-holidays/blob/daa4b804a342368b21477636f7c04ca7d86088aa/holidays/holiday_base.py#L320-L324

arkid15r commented 7 months ago

Hi @tpvasconcelos, it's great to have such a detailed and valuable feedback from time to time. We've created several PRs and significantly improved the code quality based on your ideas.

Taking into account your interest I'd like to mention that this year our goal is to release v1 of the library. We could definitely use suggestions on what features would be useful to implement from users' perspective. Any input is appreciated!