Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
The in_set predicate raises the error unhashable type: 'Series' when used with make_batch_reader and make_petastorm_dataset. I am using pandas 1.3.5. See below for a minimal working example.
import pandas as pd
from petastorm.predicates import in_set
from petastorm import make_batch_reader
from petastorm.tf_utils import make_petastorm_dataset
output_url='file:///tmp/hello_world_dataset'
hello_world = pd.DataFrame({'id': [i for i in range(100)]})
hello_world.to_parquet(output_url)
predicate_id = in_set([1,2,3,4,5],'id')
with make_batch_reader(output_url,num_epochs=1,workers_count=1,predicate=predicate_id) as reader:
ds = make_petastorm_dataset(reader)
train_values = list(ds.as_numpy_iterator())
For me, the issue is resolved by applying the in operator elementwise in the predicates.in_set function:
The
in_set
predicate raises the error unhashable type: 'Series' when used withmake_batch_reader
andmake_petastorm_dataset
. I am using pandas 1.3.5. See below for a minimal working example.For me, the issue is resolved by applying the
in
operator elementwise in thepredicates.in_set
function:Instead of the whole dataframe at once: