How to transform the string data to numerical when using make_batch_reader?

My parquet file is as follows (two files):

  item_name  price
0       laptop   10.0
1         book   20.0
2          cup   30.0
  item_name  price
0        phone   11.0
1        dress   22.0

Since make_batch_reader only supports loading scalar data type, I tried to use TransformSpec to convert item_name filed to one-hot encoding matrix, using the following function:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

My code is as follows:

dataset_url = "hdfs://my_data/parquet_dataset"
reader_epochs = 1
B_SIZE = 2

for training_epoch in range(1):
    with BatchedDataLoader(
        make_batch_reader(
            dataset_url,
            num_epochs=reader_epochs,
            schema_fields=[
                           "item_name_cup",
                           "item_name_book",
                           "price",
                           "item_name_laptop",
                           "item_name_dress",
                           "item_name_phone"],
            transform_spec=transform,
            seed=1,
            shuffle_rows=False,
            shuffle_row_groups=False),
        batch_size=B_SIZE
    ) as train_loader:

        for batch_idx, row in enumerate(train_loader):
            print(f"batch_idx:{batch_idx}")
            print(f"row:{row}")
            break

But I got KeyError: "None of [Index(['item_name'], dtype='object')] are in the [columns]". How may I resolve this? I was expecting to the the following schema:

"price",  --> float
"item_name_cup",  --> int (0 or 1)
"item_name_book",  --> int (0 or 1)
"item_name_laptop",  --> int (0 or 1)
"item_name_dress",  --> int (0 or 1)
"item_name_phone".  --> int (0 or 1)

uber / petastorm

How to transform the string data to numerical when using make_batch_reader? #788