pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

TST: Failing test when built with Cython 3.0b2 #53125

Open h-vetinari opened 1 year ago

h-vetinari commented 1 year ago

Encountered in https://github.com/conda-forge/pandas-feedstock/pull/162 on linux when building pandas 2.0.1 against Cython 3.0.0b2 and running the test suite

=================================== FAILURES ===================================
_______________________ test_add_out_of_pydatetime_range _______________________
[gw0] linux -- Python 3.10.10 $PREFIX/bin/python

    @pytest.mark.xfail(is_numpy_dev, reason="result year is 1973, unclear why")
    def test_add_out_of_pydatetime_range():
        # GH#50348 don't raise in Timestamp.replace
        ts = Timestamp(np.datetime64("-20000-12-31"))
        off = YearEnd()

        result = ts + off
        expected = Timestamp(np.datetime64("-19999-12-31"))
>       assert result == expected
E       AssertionError: assert Timestamp('1973-12-31 00:00:00') == Timestamp('-19999-12-31 00:00:00')

Thankfully, it is the only error:

=========================== short test summary info ============================
FAILED ../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold/lib/python3.9/site-packages/pandas/tests/tseries/offsets/test_year.py::test_add_out_of_pydatetime_range - AssertionError: assert Timestamp('1973-12-31 00:00:00') == Timestamp('-19999-12-31 00:00:00')
= 1 failed, 172055 passed, 23928 skipped, 912 xfailed, 9 xpassed, 445 warnings in 791.84s (0:13:11) =
Test environment ``` _libgcc_mutex: 0.1-conda_forge conda-forge _openmp_mutex: 4.5-2_gnu conda-forge attrs: 23.1.0-pyh71513ae_1 conda-forge backports.zoneinfo: 0.2.1-py39hf3d152e_7 conda-forge boto3: 1.26.129-pyhd8ed1ab_0 conda-forge botocore: 1.29.129-pyhd8ed1ab_0 conda-forge brotlipy: 0.7.0-py39hb9d737c_1005 conda-forge bzip2: 1.0.8-h7f98852_4 conda-forge ca-certificates: 2022.12.7-ha878542_0 conda-forge certifi: 2022.12.7-pyhd8ed1ab_0 conda-forge cffi: 1.15.1-py39he91dace_3 conda-forge click: 8.1.3-unix_pyhd8ed1ab_2 conda-forge colorama: 0.4.6-pyhd8ed1ab_0 conda-forge coverage: 7.2.5-py39hd1e30aa_0 conda-forge cryptography: 40.0.2-py39h079d5ae_0 conda-forge exceptiongroup: 1.1.1-pyhd8ed1ab_0 conda-forge execnet: 1.9.0-pyhd8ed1ab_0 conda-forge hypothesis: 6.75.2-pyha770c72_0 conda-forge idna: 3.4-pyhd8ed1ab_0 conda-forge importlib-metadata: 6.6.0-pyha770c72_0 conda-forge iniconfig: 2.0.0-pyhd8ed1ab_0 conda-forge jmespath: 1.0.1-pyhd8ed1ab_0 conda-forge ld_impl_linux-64: 2.40-h41732ed_0 conda-forge libblas: 3.9.0-16_linux64_openblas conda-forge libcblas: 3.9.0-16_linux64_openblas conda-forge libffi: 3.4.2-h7f98852_5 conda-forge libgcc-ng: 12.2.0-h65d4601_19 conda-forge libgfortran-ng: 12.2.0-h69a702a_19 conda-forge libgfortran5: 12.2.0-h337968e_19 conda-forge libgomp: 12.2.0-h65d4601_19 conda-forge liblapack: 3.9.0-16_linux64_openblas conda-forge libnsl: 2.0.0-h7f98852_0 conda-forge libopenblas: 0.3.21-pthreads_h78a6416_3 conda-forge libsqlite: 3.40.0-h753d276_1 conda-forge libstdcxx-ng: 12.2.0-h46fd767_19 conda-forge libuuid: 2.38.1-h0b41bf4_0 conda-forge libzlib: 1.2.13-h166bdaf_4 conda-forge ncurses: 6.3-h27087fc_1 conda-forge numpy: 1.24.3-py39h6183b62_0 conda-forge openssl: 3.1.0-hd590300_3 conda-forge packaging: 23.1-pyhd8ed1ab_0 conda-forge pandas: 2.0.1-py39hb9e473a_0 local pip: 23.1.2-pyhd8ed1ab_0 conda-forge pluggy: 1.0.0-pyhd8ed1ab_5 conda-forge psutil: 5.9.5-py39h72bdee0_0 conda-forge pycparser: 2.21-pyhd8ed1ab_0 conda-forge pyopenssl: 23.1.1-pyhd8ed1ab_0 conda-forge pysocks: 1.7.1-pyha2e5f31_6 conda-forge pytest: 7.3.1-pyhd8ed1ab_0 conda-forge pytest-asyncio: 0.21.0-pyhd8ed1ab_0 conda-forge pytest-cov: 4.0.0-pyhd8ed1ab_0 conda-forge pytest-xdist: 3.2.1-pyhd8ed1ab_0 conda-forge python: 3.9.16-h2782a2a_0_cpython conda-forge python-dateutil: 2.8.2-pyhd8ed1ab_0 conda-forge python-tzdata: 2023.3-pyhd8ed1ab_0 conda-forge python_abi: 3.9-3_cp39 conda-forge pytz: 2023.3-pyhd8ed1ab_0 conda-forge readline: 8.2-h8228510_1 conda-forge s3transfer: 0.6.1-pyhd8ed1ab_0 conda-forge setuptools: 67.7.2-pyhd8ed1ab_0 conda-forge six: 1.16.0-pyh6c4a22f_0 conda-forge sortedcontainers: 2.4.0-pyhd8ed1ab_0 conda-forge tk: 8.6.12-h27826a3_0 conda-forge toml: 0.10.2-pyhd8ed1ab_0 conda-forge tomli: 2.0.1-pyhd8ed1ab_0 conda-forge typing_extensions: 4.5.0-pyha770c72_0 conda-forge tzdata: 2023c-h71feb2d_0 conda-forge urllib3: 1.26.15-pyhd8ed1ab_0 conda-forge wheel: 0.40.0-pyhd8ed1ab_0 conda-forge xz: 5.2.6-h166bdaf_0 conda-forge zipp: 3.15.0-pyhd8ed1ab_0 conda-forge ```
h-vetinari commented 1 year ago

CC @lithomas1 (saw you worked on this in #51640) CC @matusvalo (who helped with Cython 3 compat for scipy)

h-vetinari commented 1 year ago

Windows has a couple more failures than unix:

=========================== short test summary info ===========================
FAILED ..\_test_env\lib\site-packages\pandas\tests\tseries\offsets\test_common.py::test_apply_out_of_range[tzlocal()-BQuarterEnd]
FAILED ..\_test_env\lib\site-packages\pandas\tests\tseries\offsets\test_common.py::test_apply_out_of_range[tzlocal()-BQuarterBegin]
FAILED ..\_test_env\lib\site-packages\pandas\tests\tseries\offsets\test_common.py::test_apply_out_of_range[tzlocal()-QuarterEnd]
FAILED ..\_test_env\lib\site-packages\pandas\tests\tseries\offsets\test_year.py::test_add_out_of_pydatetime_range
= 4 failed, 171914 passed, 24049 skipped, 912 xfailed, 8 xpassed, 445 warnings in 1371.60s (0:22:51) =

Failure for the first three looks like:


self = tzlocal(), dt = datetime.datetime(4507, 12, 1, 0, 0, tzinfo=tzlocal())

    def _naive_is_dst(self, dt):
        timestamp = _datetime_to_timestamp(dt)
>       return time.localtime(timestamp + time.timezone).tm_isdst
E       OSError: [Errno 22] Invalid argument

..\_test_env\lib\site-packages\dateutil\tz\tz.py:260: OSError
jbrockmendel commented 1 year ago

Same behavior we xfailed for npdev

matusvalo commented 1 year ago

@h-vetinari I have couple of questions:

h-vetinari commented 1 year ago

Was this issue introduced in Cython 3.0b2 or already present in Cython 3.0b1.

First time I'm testing it, no idea about the stat of 3.0.0b1, though we have builds of that in conda-forge and I could answer that if you want.

Have you tried cython 3 master?

Due to the separation of all the various build artefacts, it's hard to use Cython master (though not absolutely impossible, if something is urgent). Otherwise I'd retest with the next beta.

matusvalo commented 1 year ago

First time I'm testing it, no idea about the stat of 3.0.0b1, though we have builds of that in conda-forge and I could answer that if you want.

I did not went in deep with pandas and Cython3 compatibility. So It would be great to know which is last version of Cython without issue.

h-vetinari commented 1 year ago

So It would be great to know which is last version of Cython without issue.

On conda-forge infra, the latest released pandas (2.0.1) passes its test suite when compiled with the latest released Cython (0.29.34), see https://github.com/conda-forge/pandas-feedstock/pull/163

matusvalo commented 1 year ago

I double checked pandas with multiple cython versions:

Additional observation of pandas codebase compatibility with Cython 3.0:

pandas code base was not migrated to new noexcept keyword introduced in Cython beta 1. This should not be major problem in general - generated code will check for exception when dedicated return value occurs. But this can be issue mainly for functions returning void and inline functions because:

  1. void function will be checked everytime when function is called - this is most affecting nogil functions - see https://github.com/cython/cython/pull/5327
  2. inline function should be more performance critical, hence here I suppose we should avoid all overhead.
matusvalo commented 1 year ago

OK. I have found the culprit. The issue is here: https://github.com/pandas-dev/pandas/blob/a90fbc867ca9aeb66cc988f0afa2e0683364af6d/pandas/_libs/tslibs/offsets.pyx#L4540

stamp variables is declared as datetime and hence the Cython (in version 3) uses low level access to year attribute/property:

The simple reproducer is following:

from cpython.datetime cimport datetime

from pandas import Timestamp
import numpy as np

cdef datetime ts = Timestamp(np.datetime64("-20000-12-31"))
ts2 = Timestamp(np.datetime64("-20000-12-31"))
print(ts.year)
print(ts2.year)

Note: I was not investigating what is better/more correct behaviour for this use-case, neither I did not check if it is a bug in cython or not.

matusvalo commented 1 year ago

I played a bit with the issue. I am not sure whether the bug is on Cython side. Cython is using datetime API which should not support dates with year set to -2000:

>>> datetime(year=-2000, month=12, day=31)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year -2000 is out of range
matusvalo commented 1 year ago

I have investigated another bit. Here is a function definition generated by Cython:

static CYTHON_INLINE int __pyx_f_7cpython_8datetime_8datetime_4year_year(PyDateTime_DateTime *__pyx_v_self) {
  int __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("year", 0);

  /* "cpython/datetime.pxd":112
 *         @property
 *         cdef inline int year(self):
 *             return PyDateTime_GET_YEAR(self)             # <<<<<<<<<<<<<<
 * 
 *         @property
 */
  __pyx_r = PyDateTime_GET_YEAR(((PyObject *)__pyx_v_self));
  goto __pyx_L0;

  /* "cpython/datetime.pxd":111
 *     ctypedef extern class datetime.datetime[object PyDateTime_DateTime]:
 *         @property
 *         cdef inline int year(self):             # <<<<<<<<<<<<<<
 *             return PyDateTime_GET_YEAR(self)
 * 
 */

  /* function exit code */
  __pyx_L0:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

Basically it returns data directly from PyDateTime_GET_YEAR which is declared in python header file: https://github.com/python/cpython/blob/25db95d224d18fb7b7f53165aeaa87632b0230f2/Include/datetime.h#L122