pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

Bad freq invalidation in DatetimeIndex.where #24555

Open TomAugspurger opened 5 years ago

TomAugspurger commented 5 years ago

What's the expected output here?

In [16]: i = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [17]: i2 = pd.Index([pd.NaT, pd.NaT] + i[2:].tolist())

In [18]: i.where(pd.notna(i2), i2)
Out[18]: DatetimeIndex(['NaT', 'NaT', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

The returned DatetimeIndex doesn't pass freq validation.

In [23]: result._eadata._validate_frequency(result, result.freq)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    863                                           periods=len(index), freq=freq,
--> 864                                           **kwargs)
    865             if not np.array_equal(index.asi8, on_freq.asi8):

~/sandbox/pandas-alt/pandas/core/arrays/datetimes.py in _generate_range(cls, start, end, periods, freq, tz, normalize, ambiguous, nonexistent, closed)
    299         if start is NaT or end is NaT:
--> 300             raise ValueError("Neither `start` nor `end` can be NaT")
    301

ValueError: Neither `start` nor `end` can be NaT

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-23-24fa3f452eb0> in <module>
----> 1 result._eadata._validate_frequency(result, result.freq)

~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    877             raise ValueError('Inferred frequency {infer} from passed values '
    878                              'does not conform to passed frequency {passed}'
--> 879                              .format(infer=inferred, passed=freq.freqstr))
    880
    881     # monotonicity/uniqueness properties are called via frequencies.infer_freq,

ValueError: Inferred frequency None from passed values does not conform to passed frequency D

Should the freq be None?

TomAugspurger commented 5 years ago

Another one.

In [16]: idx = pd.date_range('2014-01-02', '2014-04-30', freq='M', tz='UTC')

In [17]: result = idx.tz_convert("US/Eastern")

In [18]: result
Out[18]:
DatetimeIndex(['2014-01-30 19:00:00-05:00', '2014-02-27 19:00:00-05:00',
               '2014-03-30 20:00:00-04:00', '2014-04-29 20:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', freq='M')

In [19]: result._eadata._validate_frequency(result, result.freq)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    913             if not np.array_equal(index.asi8, on_freq.asi8):
--> 914                 raise ValueError
    915         except ValueError as e:

ValueError:

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-19-24fa3f452eb0> in <module>
----> 1 result._eadata._validate_frequency(result, result.freq)

~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    925             raise ValueError('Inferred frequency {infer} from passed values '
    926                              'does not conform to passed frequency {passed}'
--> 927                              .format(infer=inferred, passed=freq.freqstr))
    928
    929     # monotonicity/uniqueness properties are called via frequencies.infer_freq,

ValueError: Inferred frequency None from passed values does not conform to passed frequency M

though, perhaps there's a bug in the freq validation around DST boundaries? But maybe not. Here's the range for US/Eastern

In [36]: pd.date_range('2014-01-02', '2014-04-30', freq='M', tz='US/Eastern')
Out[36]:
DatetimeIndex(['2014-01-31 00:00:00-05:00', '2014-02-28 00:00:00-05:00',
               '2014-03-31 00:00:00-04:00', '2014-04-30 00:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', freq='M')

So should tz_convert invalidate the freq?

TomAugspurger commented 5 years ago

One more. In this case we seem to generate an array from bdate_range that doesn't have a valid freq (not sure if the bug is in the generation or the freq validation, probably the validation).

START = pd.Timestamp(2009, 3, 13)
END1 = pd.Timestamp(2009, 3, 18)
END2 = pd.Timestamp(2009, 3, 19)

freq = 'CBH'
a = pd.bdate_range(START, END1, freq=freq, weekmask='Mon Wed Fri',
                   holidays=['2009-03-14'])
b = pd.bdate_range(START, END2, freq=freq, weekmask='Mon Wed Fri',
                   holidays=['2009-03-14'])

a._eadata._validate_frequency(a, a.freq)
b._eadata._validate_frequency(b, b.freq)

a validates fine, but b doesn't

In [44]: b._eadata._validate_frequency(b, b.freq)
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    913             if not np.array_equal(index.asi8, on_freq.asi8):
--> 914                 raise ValueError
    915         except ValueError as e:

ValueError:

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-44-2b6f5f040d09> in <module>
----> 1 b._eadata._validate_frequency(b, b.freq)

~/sandbox/pandas-alt/pandas/core/arrays/datetimelike.py in _validate_frequency(cls, index, freq, **kwargs)
    925             raise ValueError('Inferred frequency {infer} from passed values '
    926                              'does not conform to passed frequency {passed}'
--> 927                              .format(infer=inferred, passed=freq.freqstr))
    928
    929     # monotonicity/uniqueness properties are called via frequencies.infer_freq,

ValueError: Inferred frequency None from passed values does not conform to passed frequency CBH

In the freq validation for b we generate an on_freq with the wrong(?) number of periods

ipdb> len(on_freq)
16
ipdb> len(index)
24
TomAugspurger commented 5 years ago

Do we have a policy on when an operation that might invalidate a freq should infer vs. just set it to None? For example, in DatetimeIndex.where we could either do _shallow_copy(freq=None) or _shallow_copy_with_infer.

TomAugspurger commented 5 years ago

I think that a fix for these issues (invalidating in places where needed, maybe fixing some bugs in the current freq validation) and a fix for https://github.com/pandas-dev/pandas/issues/24562 will open up freq validation in DatetimeArray.__init__

jbrockmendel commented 4 years ago

I think [the OP example, not the others] was fixed by a semi-recent PR that implemented DTI/TDI.where and always sets the resulting freq to None.