Closed WonderingWJ closed 5 months ago
Hi @WonderingWJ , can you provide some information about df_total
? In particular the dtypes
of the columns, which can be found using print(df_total.dtypes)
.
There are a lot of columns in df_total
, output of print(df_total.dtypes)
passenger_id object
bubbling_id object
bubble_time datetime64[us]
is_send int64
is_finish object
...
bubble_minute int16
bubble_second int16
bubble_time_period int64
minute_to_period int64
is_workday int64
Length: 92, dtype: object
Which column's dtype you want to know ?
Thanks @WonderingWJ , could you provide the dtype of the column named 'bubble_timestamp'
?
int64
Thanks - looking into this now.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Without the original data it is hard to say for sure, but as best as I can tell the original snippet no longer fails using some toy data that matches the data types discussed above:
import cudf
import numpy as np
df_total = cudf.DataFrame({
'passenger_id': [1, 1, 1, 2, 2, 2],
'bubble_timestamp': [1, 2, 3, 1, 2, 3],
'shift_timestamp': [0, 1, 2, 0, 1, 2],
})
df_total['shift_timestamp'] = df_total.groupby('passenger_id')['bubble_timestamp'].shift(1).to_arrow()
def kernel(bubble_timestamp,shift_timestamp, second_diff,):
for i, (x, y) in enumerate(zip(bubble_timestamp, shift_timestamp)):
second_diff[i] = x-y
df_total=df_total.apply_rows(kernel
,incols=["bubble_timestamp","shift_timestamp"]
,outcols=dict(second_diff=np.int64)
,kwargs={}
)
df_total['second_diff'].fillna(0, inplace=True)
The broader issue raised in the title (UDFs not supporting some types) is being tracked more holistically in other issues (such as #9639).
Describe the bug In cudf
21.12.00a+293.g0930f712e6
, there is error logTypeError: User defined functions are currently not supported on Series with dtypes str and category
. But in cudf21.08.03
, the code below can successfully run.Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior Successfully run
Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context Add any other context about the problem here.