Traceback (most recent call last):
File "chicago_taxi_fare.py", line 25, in <module>
dataset_transformed = chained_pp.fit_transform(dataset)
File "/Users/xwjiang/ray/python/ray/ml/preprocessors/chain.py", line 54, in fit_transform
ds = preprocessor.fit_transform(ds)
File "/Users/xwjiang/ray/python/ray/ml/preprocessor.py", line 96, in fit_transform
self.fit(dataset)
File "/Users/xwjiang/ray/python/ray/ml/preprocessor.py", line 81, in fit
return self._fit(dataset)
File "/Users/xwjiang/ray/python/ray/ml/preprocessors/imputer.py", line 55, in _fit
self.stats_ = _get_most_frequent_values(dataset, *self.columns)
File "/Users/xwjiang/ray/python/ray/ml/preprocessors/imputer.py", line 98, in _get_most_frequent_values
final_counters[i] += col_value_counts
IndexError: list index out of range
Versions / Dependencies
master
Reproduction script
import ray
from ray.ml.train.integrations.tensorflow import TensorflowTrainer
from ray.ml.preprocessors import BatchMapper, OneHotEncoder, Chain, SimpleImputer
import pandas as pd
taxi_data = pd.read_csv("https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv")
dataset = ray.data.from_pandas(taxi_data)
# Remove these columns as we don't have buckitizer yet..
pp1 = BatchMapper(lambda x: x.drop(["dropoff_latitude", "dropoff_longitude", "pickup_latitude", "pickup_longitude"], axis=1))
pp2 = SimpleImputer(["pickup_census_tract", "dropoff_census_tract"], strategy="most_frequent")
pp3 = OneHotEncoder(columns=['trip_start_hour', 'trip_start_day', 'trip_start_month',
'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',
'dropoff_community_area'])
chained_pp = Chain(pp1, pp2, pp3)
dataset_transformed = chained_pp.fit_transform(dataset)
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
Versions / Dependencies
master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.