ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

[air - preprocessor] "most frequent" SimpleImputer crashes #24494

Closed xwjiang2010 closed 2 years ago

xwjiang2010 commented 2 years ago

What happened + What you expected to happen

Traceback (most recent call last):
  File "chicago_taxi_fare.py", line 25, in <module>
    dataset_transformed = chained_pp.fit_transform(dataset)
  File "/Users/xwjiang/ray/python/ray/ml/preprocessors/chain.py", line 54, in fit_transform
    ds = preprocessor.fit_transform(ds)
  File "/Users/xwjiang/ray/python/ray/ml/preprocessor.py", line 96, in fit_transform
    self.fit(dataset)
  File "/Users/xwjiang/ray/python/ray/ml/preprocessor.py", line 81, in fit
    return self._fit(dataset)
  File "/Users/xwjiang/ray/python/ray/ml/preprocessors/imputer.py", line 55, in _fit
    self.stats_ = _get_most_frequent_values(dataset, *self.columns)
  File "/Users/xwjiang/ray/python/ray/ml/preprocessors/imputer.py", line 98, in _get_most_frequent_values
    final_counters[i] += col_value_counts
IndexError: list index out of range

Versions / Dependencies

master

Reproduction script

import ray

from ray.ml.train.integrations.tensorflow import TensorflowTrainer

from ray.ml.preprocessors import BatchMapper, OneHotEncoder, Chain, SimpleImputer

import pandas as pd

taxi_data = pd.read_csv("https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv")

dataset = ray.data.from_pandas(taxi_data)

# Remove these columns as we don't have buckitizer yet..
pp1 = BatchMapper(lambda x: x.drop(["dropoff_latitude", "dropoff_longitude", "pickup_latitude", "pickup_longitude"], axis=1))

pp2 = SimpleImputer(["pickup_census_tract", "dropoff_census_tract"], strategy="most_frequent")

pp3 = OneHotEncoder(columns=['trip_start_hour', 'trip_start_day', 'trip_start_month',
    'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',
    'dropoff_community_area'])

chained_pp = Chain(pp1, pp2, pp3)
dataset_transformed = chained_pp.fit_transform(dataset)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

xwjiang2010 commented 2 years ago

@matthewdeng

Yard1 commented 2 years ago

Cannot repro on master, seems to be fixed already.