uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k stars 281 forks source link

AttributeError: 'bool' object has no attribute 'map' ​while using Predicate #789

Open littlehomelessman opened 1 year ago

littlehomelessman commented 1 year ago

Hello team,

I'm trying to split training set and test set in a 80:20 ratio using predicate. And I got the following error:

/home/xzk/.local/lib/python3.7/site-packages/petastorm/hdfs/namenode.py:270: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
  return pyarrow.hdfs.connect(hostname, url.port or 8020, **kwargs)
Worker 3 terminated: unexpected exception:
Traceback (most recent call last):
  File "/home/xzk/.local/lib/python3.7/site-packages/petastorm/workers_pool/thread_pool.py", line 62, in run
    self._worker_impl.process(*args, **kargs)
  File "/home/xzk/.local/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 150, in process
    all_cols = self._load_rows_with_predicate(parquet_file, piece, worker_predicate, shuffle_row_drop_partition)
  File "/home/xzk/.local/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 258, in _load_rows_with_predicate
    erase_mask = match_predicate_mask.map(operator.not_)
AttributeError: 'bool' object has no attribute 'map'
Iteration on Petastorm DataLoader raise error: AttributeError("'bool' object has no attribute 'map'")

I notice that:

~/.local/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py in _load_rows_with_predicate(self, pq_file, piece, worker_predicate, shuffle_row_drop_partition)
    256 
    257         match_predicate_mask = worker_predicate.do_include(predicates_data_frame)
--> 258         erase_mask = match_predicate_mask.map(operator.not_)

Where do_include(...) seems to return bool only.

Is this a bug? Or I'm using predicate in a wrong way? Please help, thank you!

My code:

def train_model(num_epochs=100, batch_size=1000):

    for epoch in range(num_epochs):

        with DataLoader(
            make_batch_reader(dataset_url, num_epochs=reader_epochs, schema_fields=None,
                              transform_spec=None, seed=1, shuffle_rows=False, shuffle_row_groups=False,
                             predicate=in_pseudorandom_split([0.8, 0.2], 0, "some_column_name")),
            batch_size=150) as dataloader:

            for raw in dataloader:
                print(raw)
                break