rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.43k stars 903 forks source link

[BUG]User defined functions are currently not supported on Series with dtypes str and category #10722

Closed WonderingWJ closed 5 months ago

WonderingWJ commented 2 years ago

Describe the bug In cudf 21.12.00a+293.g0930f712e6, there is error log TypeError: User defined functions are currently not supported on Series with dtypes str and category. But in cudf 21.08.03, the code below can successfully run.

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

df_total['shift_timestamp'] = df_total.groupby('passenger_id')['bubble_timestamp'].shift(1).to_arrow()
def kernel(bubble_timestamp,shift_timestamp, second_diff,):
    for i, (x, y) in enumerate(zip(bubble_timestamp, shift_timestamp)):
            second_diff[i] = x-y
df_total=df_total.apply_rows(kernel
                        ,incols=["bubble_timestamp","shift_timestamp"]
                        ,outcols=dict(second_diff=np.int64)
                        ,kwargs={}
                       )
df_total['second_diff'].fillna(0, inplace=True)

Expected behavior Successfully run

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=20.04
     DISTRIB_CODENAME=focal
     DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
     NAME="Ubuntu"
     VERSION="20.04.3 LTS (Focal Fossa)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 20.04.3 LTS"
     VERSION_ID="20.04"
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     VERSION_CODENAME=focal
     UBUNTU_CODENAME=focal
     Linux 1d14d3e7c968 4.15.0-96-generic #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020 x86_64 x86_64 x86_64 GNU/
Linux

     ***GPU Information***
     Sun Apr 24 07:24:15 2022
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.6     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                               |                      |               MIG M. |
     |===============================+======================+======================|
     |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
     | N/A   31C    P0    42W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
     | N/A   33C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
     | N/A   32C    P0    42W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
     | N/A   30C    P0    42W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
     | N/A   31C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
     | N/A   33C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
     | N/A   34C    P0    44W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
     | N/A   32C    P0    42W / 300W |      3MiB / 32510MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                                  |
     |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
     |        ID   ID                                                   Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:                    x86_64
     CPU op-mode(s):                  32-bit, 64-bit
     Byte Order:                      Little Endian
     Address sizes:                   46 bits physical, 48 bits virtual
     CPU(s):                          80
     On-line CPU(s) list:             0-79
     Thread(s) per core:              2
     Core(s) per socket:              20
     Socket(s):                       2
     NUMA node(s):                    2
     Vendor ID:                       GenuineIntel
     CPU family:                      6
     Model:                           79
     Model name:                      Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
     Stepping:                        1
     CPU MHz:                         3443.694
     CPU max MHz:                     3600.0000
     CPU min MHz:                     1200.0000
     BogoMIPS:                        4390.07
     Virtualization:                  VT-x
     L1d cache:                       1.3 MiB
     L1i cache:                       1.3 MiB
     L2 cache:                        10 MiB
     L3 cache:                        100 MiB
     NUMA node0 CPU(s):               0-19,40-59
     NUMA node1 CPU(s):               20-39,60-79
     Vulnerability Itlb multihit:     KVM: Vulnerable
     Vulnerability L1tf:              Mitigation; PTE Inversion; VMX vulnerable
     Vulnerability Mds:               Vulnerable; SMT vulnerable
     Vulnerability Meltdown:          Vulnerable
     Vulnerability Spec store bypass: Vulnerable
     Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no s
wapgs barriers
     Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
     Vulnerability Tsx async abort:   Vulnerable

Additional context Add any other context about the problem here.

brandon-b-miller commented 2 years ago

Hi @WonderingWJ , can you provide some information about df_total? In particular the dtypes of the columns, which can be found using print(df_total.dtypes).

WonderingWJ commented 2 years ago

There are a lot of columns in df_total, output of print(df_total.dtypes)

passenger_id                  object
bubbling_id                   object
bubble_time           datetime64[us]
is_send                        int64
is_finish                     object
                           ...
bubble_minute                  int16
bubble_second                  int16
bubble_time_period             int64
minute_to_period               int64
is_workday                     int64
Length: 92, dtype: object

Which column's dtype you want to know ?

brandon-b-miller commented 2 years ago

Thanks @WonderingWJ , could you provide the dtype of the column named 'bubble_timestamp'?

WonderingWJ commented 2 years ago

int64

brandon-b-miller commented 2 years ago

Thanks - looking into this now.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 5 months ago

Without the original data it is hard to say for sure, but as best as I can tell the original snippet no longer fails using some toy data that matches the data types discussed above:

import cudf
import numpy as np

df_total = cudf.DataFrame({
    'passenger_id': [1, 1, 1, 2, 2, 2],
    'bubble_timestamp': [1, 2, 3, 1, 2, 3],
    'shift_timestamp': [0, 1, 2, 0, 1, 2],
})

df_total['shift_timestamp'] = df_total.groupby('passenger_id')['bubble_timestamp'].shift(1).to_arrow()

def kernel(bubble_timestamp,shift_timestamp, second_diff,):
    for i, (x, y) in enumerate(zip(bubble_timestamp, shift_timestamp)):
        second_diff[i] = x-y

df_total=df_total.apply_rows(kernel
                        ,incols=["bubble_timestamp","shift_timestamp"]
                        ,outcols=dict(second_diff=np.int64)
                        ,kwargs={}
                       )
df_total['second_diff'].fillna(0, inplace=True)

The broader issue raised in the title (UDFs not supporting some types) is being tracked more holistically in other issues (such as #9639).