[BUG] ARIMA "Missing values are accepted, represented by NaN" But which NaN?

singlecheeze commented 1 year ago

Docs here state NaN is allowed for input array: https://docs.rapids.ai/api/cuml/stable/api.html#arima

import numpy as np
import cudf
import cupy as cp
from cuml.tsa.arima import ARIMA

array = cudf.DataFrame()
array[0] = [0, 1, 2, 3, 4]
for n in [np.NaN, np.nan, np.NAN, cudf.NA, None, cp.nan]:
    try:
        array[1] = [n, 1, 2, 3, 4]

        model = ARIMA(
            array,
            order=(1, 1, 1),
            seasonal_order=(0, 0, 0, 0),
            fit_intercept=True
        )

        print(f"{type(n)}{n} worked!")
    except:
        print(f"{type(n)}{n} didn't work")

Output:

<class 'float'>nan didn't work
<class 'float'>nan didn't work
<class 'float'>nan didn't work
<class 'pandas._libs.missing.NAType'><NA> didn't work
<class 'NoneType'>None didn't work
<class 'float'>nan didn't work

Traceback in each case:

Traceback (most recent call last):
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 360, in inner
    return func(*args, **kwargs)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cuml/common/input_utils.py", line 309, in input_to_cuml_array
    X = convert_dtype(X,
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cuml/internals/api_decorators.py", line 360, in inner
    return func(*args, **kwargs)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cuml/common/input_utils.py", line 578, in convert_dtype
    would_lose_info = _typecast_will_lose_information(X, to_dtype)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cuml/common/input_utils.py", line 630, in _typecast_will_lose_information
    X_m = X.values
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/frame.py", line 429, in values
    return self.to_cupy()
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/frame.py", line 529, in to_cupy
    return self._to_array(
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/frame.py", line 494, in _to_array
    matrix[:, i] = get_column_values_na(col)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/frame.py", line 473, in get_column_values_na
    return get_column_values(col)
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/frame.py", line 532, in <lambda>
    else (lambda col: col.values),
  File "/home/dave/anaconda3/envs/rapids-22.10/lib/python3.9/site-packages/cudf/core/column/column.py", line 175, in values
    raise ValueError("Column must have no nulls.")
ValueError: Column must have no nulls.

singlecheeze commented 1 year ago

It seems cupy arrays work fine (Maybe this needs to live in cuDF bug tracker?):

import numpy as np
import cupy as cp
from cuml.tsa.arima import ARIMA

for n in [np.NaN, np.nan, np.NAN, cp.nan]:
    try:
        array = cp.array([n, 1, 2, 3, 4])

        model = ARIMA(
            array,
            order=(1, 1, 1),
            # simple_differencing=False
        )

        print(f"{type(n)}{n} worked!")
    except:
        print(f"{type(n)}{n} didn't work")

Output:

[W] [23:30:36.795767] Missing observations detected. Forcing simple_differencing=False
<class 'float'>nan worked!
[W] [23:30:36.796889] Missing observations detected. Forcing simple_differencing=False
<class 'float'>nan worked!
[W] [23:30:36.797940] Missing observations detected. Forcing simple_differencing=False
<class 'float'>nan worked!
[W] [23:30:36.798966] Missing observations detected. Forcing simple_differencing=False
<class 'float'>nan worked!

beckernick commented 1 year ago

The underlying issue you're hitting is that nulls are not NaNs. In cuDF missing values are by default "null" (like in the new Pandas nullable dtypes). This is common in columnar data representations, but less common in array representations. You will only get a NaN by default with cuDF if you genuinely get a NaN (such as taking the square root of a negative number).

When we call .values under the hood, we're converting from cuDF to a CuPy array. CuPy doesn't understand nulls, so we prohibit the conversion. Depending on how you're creating your data, you can force the desired behavior with something like:

s = cudf.Series([np.nan, 1, 2, 3, 4], nan_as_null=False)

This parameter is also available in cudf.from_pandas.

singlecheeze commented 1 year ago

Thank you for the timely response @beckernick !

I'll close this if it will let me and leave this link that might be helpful for others: https://docs.rapids.ai/api/cudf/stable/user_guide/missing-data.html

rapidsai / cuml

[BUG] ARIMA "Missing values are accepted, represented by NaN" But which NaN? #4967