rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.09k stars 874 forks source link

[QST] TypeError: Argument 'real' has incorrect type (expected numpy.ndarray, got ndarray) #16029

Closed blue-cat-whale closed 1 month ago

blue-cat-whale commented 1 month ago

Problem comes from line 62 inside function my_func_single. Code:

import sys, os
import numpy as np
try:
    import cudf.pandas
    cudf.pandas.install()
except:
    print('cudf.pandas load failed')
from cudf.pandas.module_accelerator import disable_module_accelerator
import pandas as pd
from random import randint
from datetime import datetime, timedelta, date

from functools import partial
from concurrent.futures import ProcessPoolExecutor as Pool
from multiprocessing import set_start_method
import talib as tl

def data_generation(nRows: int):
################## unimportant, for reproducing purpose ###################
# This function generates the dataframe obj, which has 5 columns, and the data are sorted by WorkingDay and Minute ascendingly
    my_df = pd.DataFrame(data={'WorkingDay': ['2019-01-02', '2018-01-02', '2019-05-02', '2020-01-02', '2021-01-02'], 'name': ['albert', 'alex', 'alice', 'ben', 'bob'], 'Minute': ['09:00:00', '09:20:00', '08:00:00', '07:00:00', '09:30:00'], 'aaa': np.  random.rand(5), 'bbb': np.    random.rand(5)})
    my_df = pd.concat([my_df for i in range(int(nRows/5))], axis=0)
    my_df['WorkingDay'] = my_df['WorkingDay'].map(lambda x: (date(randint(2010,2020), randint(1,4), randint(1,5))).strftime('%Y-%m-%d'))
    my_df['Minute'] = np.random.permutation(my_df['Minute'].values)
    my_df = my_df.sort_values(by=['WorkingDay', 'Minute'], inplace=False).reset_index(drop=True,inplace=False)
    return my_df

def my_apply(df, bias: int, n_l: list):
    df_padding = None
    t_now = datetime.strptime(df['Minute'], '%H:%M:%S')
    for i in range(2):
        df_padding = pd.concat([df_padding,df],axis=1)
        df_padding = df_padding.T.reset_index(drop=True, inplace=False)
        df_padding.loc[df_padding.index[-1],'aaa'] = df['aaa'] + i
        df_padding.loc[df_padding.index[-1],'name'] = n_l[i]
        df_padding.loc[df_padding.index[-1],'bbb'] = df['bbb'] + bias
        t_now = t_now+timedelta(minutes=2)
        df_padding.loc[df_padding.index[-1],'Minute'] = t_now.strftime('%H:%M:%S')
        df_padding = df_padding.T
    return df_padding.transpose()

def get_macd(signal: pd.Series, ema1: int, ema2: int, dem: int):
# util function
    macd_dif, macd_dea, macd_bar = tl.MACD(signal.to_numpy(), ema1, ema2, dem)
    diff = signal.ewm(span=ema1).mean() - signal.ewm(span=ema2).mean()
    dea = diff.ewm(span=dem).mean()
    return 2 * (diff - dea)

def my_func_single(branchIndex: int):
    my_df = data_generation(20-5*branchIndex)
    my_df[['WorkingDay','name','Minute']] = my_df[['WorkingDay','name','Minute']].astype('string')
    name_list = ['a_albert', 'b_bob', 'c_chris', 'd_dave']
# data generated
    df_padding = my_df.apply(my_apply,axis=1,bias=branchIndex,n_l=name_list)
    df_padding = df_padding.T.dropna().reset_index(drop=True)
    df_padding = pd.concat([r for r in df_padding],axis=0).reset_index(drop=True)
# -------------------------- The problem comes from below ------------------------
    df_padding['aaa'] = get_macd(df_padding['aaa'],4,7,3)
    print(df_padding)
    return df_padding

def my_func():
    set_start_method('spawn')
    my_func_partial = partial(my_func_single)
    with Pool(max_workers=2) as pool:
        r = pool.map(my_func_partial, range(3))
    for obj in r:
        print('df has length: {}.'.format(obj.shape[0]))

def main():
    print('-------------------- program starts -----------------------')
    my_func()

if __name__ == '__main__':
    main()

Error:

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 256, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
            ^^^^^^^^^
  File "/home/<user_name>/code/test_cuda/tt.py", line 62, in my_func_single
    df_padding['aaa'] = get_macd(df_padding['aaa'],4,7,3)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/<user_name>/code/test_cuda/tt.py", line 47, in get_macd
    macd_dif, macd_dea, macd_bar = tl.MACD(signal.to_numpy(), ema1, ema2, dem)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/talib/__init__.py", line 64, in wrapper
    result = func(*_args, **_kwds)
             ^^^^^^^^^^^^^^^^^^^^^
TypeError: Argument 'real' has incorrect type (expected numpy.ndarray, got ndarray)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/<user_name>/code/test_cuda/tt.py", line 82, in <module>
    main()
    ^^^^^^
  File "/home/<user_name>/code/test_cuda/tt.py", line 78, in main
    my_func()
  File "/home/<user_name>/code/test_cuda/tt.py", line 72, in my_func
    for obj in r:
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 606, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
TypeError: Argument 'real' has incorrect type (expected numpy.ndarray, got ndarray)

package version:

Name: cudf-cu12
Version: 24.6.0
Summary: cuDF - GPU Dataframe
Home-page:
Author: NVIDIA Corporation
Author-email:
License: Apache 2.0
Location: /usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages
Requires: cachetools, cuda-python, cupy-cuda12x, fsspec, numba, numpy, nvtx, packaging, pandas, pyarrow, pynvjitlink-cu12, rich, rmm-cu12, typing_extensions
Required-by:

The relevant package in the above code:

(cudf) [root@localhost nn]# pip show ta-lib
Name: TA-Lib
Version: 0.4.28
Summary: Python wrapper for TA-Lib
Home-page: http://github.com/ta-lib/ta-lib-python
Author: John Benediktsson
Author-email: mrjbq7@gmail.com
License: BSD
Location: /usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages
Requires: numpy
Required-by:
wence- commented 1 month ago

This is a consequence of not fully supporting wrapped numpy arrays when using cudf.pandas. See the feature issue here: https://github.com/rapidsai/cudf/issues/15397

The broad problem is that because numpy has a C API, providing proxied objects like with do with pandas and cudf.pandas is more difficult.

As a consequence, although when you call to_numpy() on a cudf.pandas Series you get something that behaves like a numpy array, we do not answer "yes" to the question isinstance(numpy_like_thing, np.ndarray).

Matt711 commented 1 month ago

I'm going to leave this open for a little while I look into #14537

blue-cat-whale commented 1 month ago

This is a consequence of not fully supporting wrapped numpy arrays when using cudf.pandas. See the feature issue here: #15397

The broad problem is that because numpy has a C API, providing proxied objects like with do with pandas and cudf.pandas is more difficult.

As a consequence, although when you call to_numpy() on a cudf.pandas Series you get something that behaves like a numpy array, we do not answer "yes" to the question isinstance(numpy_like_thing, np.ndarray).

The function requires a numpy array. Is it possible to make small changes to my code such that we can bypass that problem (still using the tl.MACD function while cudf active)?

wence- commented 1 month ago

The function requires a numpy array. Is it possible to make small changes to my code such that we can bypass that problem (still using the tl.MACD function while cudf active)?

With cudf.pandas activated, you can obtain a real numpy array from a Series by using numpy.asarray rather than to_numpy(). This works seamlessly with real pandas Series objects as well:

So try:

macd_dif, macd_dea, macd_bar = tl.MACD(np.asarray(signal), ema1, ema2, dem)

One thing to note is that you should treat the view you get back from np.asarray as a read-only view. There are some circumstances in which we can propagate writes back to the original fast-slow signal object, but many in which we can't (that needs #15397 to be solved properly).

However, if you're just doing this to pass the series as input to a third-party library, the above should work well.