uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

in_set predicate raises error unhashable type: 'Series' #773

Open Joachim-Sh opened 2 years ago

Joachim-Sh commented 2 years ago

The in_set predicate raises the error unhashable type: 'Series' when used with make_batch_reader and make_petastorm_dataset. I am using pandas 1.3.5. See below for a minimal working example.

import pandas as pd
from petastorm.predicates import in_set
from petastorm import make_batch_reader
from petastorm.tf_utils import make_petastorm_dataset

output_url='file:///tmp/hello_world_dataset'
hello_world = pd.DataFrame({'id': [i for i in range(100)]})
hello_world.to_parquet(output_url)

predicate_id = in_set([1,2,3,4,5],'id')
with make_batch_reader(output_url,num_epochs=1,workers_count=1,predicate=predicate_id) as reader:
    ds = make_petastorm_dataset(reader)
    train_values = list(ds.as_numpy_iterator())

For me, the issue is resolved by applying the in operator elementwise in the predicates.in_set function:

def do_include(self, values):
   def apply_elementwise(input):
       return input in self._inclusion_values
   return values[self._predicate_field].apply(apply_elementwise)

Instead of the whole dataframe at once:

def do_include(self, values):
    return values[self._predicate_field] in self._inclusion_values