vacanza / holidays

Generate and work with holidays in Python
https://pypi.org/project/holidays
MIT License
1.45k stars 460 forks source link

np.isin does not have the same effect as "in". #1920

Closed Dortaj closed 1 month ago

Dortaj commented 1 month ago

Bug Report

If we have an np.array of date times want to check if they exist in a holidays object we cannot use np.isin while it works fine if we apply in on each element separately that means we should apply it with other tricks.

Expected Behavior

np.isin should have the same behavior as applying in on each element of a given np.array of datetimes.

Actual Behavior

Currently, using np.isin cannot recognize anything, and it returns basically none of the elements those exist in holidays. If we use "in" independently on each of these elements, they are fine.

Steps to Reproduce the Problem

  1. Create an np.array of datetimes (preferably with some holidays in it)
  2. Use np.isin if there are holidays in the given np.array.
  3. It can never find any

Environment

arkid15r commented 1 month ago

Hi @Dortaj, thanks for raising this!

holidays package doesn't provide any numpy related functionality directly. It seems that the issue on a higher level of abstraction. If you provide some code we might be able to help you better. Perhaps you would have better results if used date instead of datatime objects (just my guess).

Meanwhile here is an AI generated response based on the issue text:

The issue arises because np.isin() is designed to work with arrays where direct equality comparisons are valid, such as numbers or strings. However, np.isin() does not work correctly with complex data types like datetime objects, especially when comparing against a set or collection that involves more complex membership logic, like the holidays object.

Expected Behavior

The expectation is that np.isin() should behave similarly to the in operator applied element-wise to check membership within a holidays object. This would allow you to check if any datetime objects in the numpy array are considered holidays.

Actual Behavior

np.isin() fails to recognize datetime objects as being in the holidays object, even when some of them are indeed holidays. The function returns False for all elements, indicating none of them are holidays, which is incorrect.

Steps to Reproduce

import numpy as np
import datetime
import holidays

# Create an np.array of datetime objects
date_array = np.array([
    datetime.datetime(2024, 12, 25),  # Christmas, should be a holiday
    datetime.datetime(2024, 12, 31),  # New Year's Eve, might be a holiday
    datetime.datetime(2024, 11, 1)    # A regular day, not a holiday
])

# Create a holidays object
us_holidays = holidays.UnitedStates(years=[2024])

# Attempt to use np.isin to check for holidays
result = np.isin(date_array, us_holidays)
print(result)  # Expecting [True, False, False], but will likely get [False, False, False]

Problem Explanation

np.isin() is likely failing because it's performing a straightforward equality check, which doesn't account for the way datetime objects need to be compared to the entries in the holidays object.

Workaround

You can achieve the desired behavior by applying a vectorized or element-wise approach to check membership:

# Vectorized approach using a list comprehension
result = np.array([date in us_holidays for date in date_array])
print(result)  # This should return [True, False, False]

Environment Details

  • OS: MacOS
  • Python version: 3.12
  • holidays version: 0.54

This issue is less about a bug in numpy and more about a limitation in np.isin() with certain complex data types like datetime. The workaround using list comprehension or another vectorized approach should give you the correct results.

KJhellico commented 1 month ago

According to docs,

element and _testelements are converted to arrays if they are not already

For correct conversion and comparison, you should use something like

date_array = np.array([
    datetime.date(2024, 12, 25),  # Christmas, should be a holiday
    datetime.date(2024, 12, 31),  # New Year's Eve, might be a holiday
    datetime.date(2024, 11, 1)    # A regular day, not a holiday
])
us_holidays = holidays.UnitedStates(years=[2024])
result = np.isin(date_array, list(us_holidays.keys()))
print(result)