uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Failure in "test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache" #642

Closed chongxiaoc closed 3 years ago

chongxiaoc commented 3 years ago

The parameter shuffling_queue_capacity is not used in the unit test. That means it is always 0.

https://github.com/uber/petastorm/blob/15b35798d6140efe90f8467072dd55d12a8f79c1/petastorm/tests/test_pytorch_dataloader.py#L230

However, if I added shuffling_queue_capacity into extra_loader_params, test with shuffling_queue_capacity=20 are all failing. https://github.com/uber/petastorm/blob/15b35798d6140efe90f8467072dd55d12a8f79c1/petastorm/tests/test_pytorch_dataloader.py#L238

(petastorm_venv3.7) root@4d7bc42e93c2:/petastorm/petastorm/tests# pytest -v test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache
==================================================================== test session starts ====================================================================
platform linux -- Python 3.7.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /petastorm_venv3.7/bin/python3.7
cachedir: .pytest_cache
rootdir: /petastorm
plugins: forked-1.3.0, timeout-1.4.2, cov-2.11.1, logger-0.5.1
collected 16 items

test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[1-make_batch_reader-20] FAILED                                              [  6%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[1-make_batch_reader-0] PASSED                                               [ 12%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[1-make_reader-20] FAILED                                                    [ 18%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[1-make_reader-0] PASSED                                                     [ 25%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[2-make_batch_reader-20] FAILED                                              [ 31%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[2-make_batch_reader-0] PASSED                                               [ 37%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[2-make_reader-20] FAILED                                                    [ 43%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[2-make_reader-0] PASSED                                                     [ 50%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[3-make_batch_reader-20] FAILED                                              [ 56%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[3-make_batch_reader-0] PASSED                                               [ 62%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[3-make_reader-20] FAILED                                                    [ 68%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[3-make_reader-0] PASSED                                                     [ 75%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[None-make_batch_reader-20] FAILED                                           [ 81%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[None-make_batch_reader-0] PASSED                                            [ 87%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[None-make_reader-20] FAILED                                                 [ 93%]
test_pytorch_dataloader.py::test_batched_data_loader_with_in_memory_cache[None-make_reader-0] PASSED                                                  [100%]

Is this a known issue or we just missed that?

@abditag2 @selitvin @tgaddair

chongxiaoc commented 3 years ago

Ok, with pytest -s, it is showing _add_many raises runtime error as below:

            retrieved_so_far = None
            for idx in range(5):
>               batch = next(it)

test_pytorch_dataloader.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../pytorch.py:124: in __iter__
    for batch in self._iter_impl():
../pytorch.py:394: in _iter_impl
    for b in self._iter_impl_worker():
../pytorch.py:441: in _iter_impl_worker
    other_shuffling_buffer.add_many(batch.values())
../reader_impl/pytorch_shuffling_buffer.py:36: in add_many
    return self._add_many(items)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <petastorm.reader_impl.pytorch_shuffling_buffer.BatchedRandomShufflingBuffer object at 0x7f26794ad8d0>
items = [tensor([230000,  50000, 400000, 300000, 390000, 310000, 110000,      0, 180000,
         70000], dtype=torch.int32)]

    def _add_many(self, items):
        if self._done_adding:
            raise RuntimeError('Can not call add_many after done_adding() was called.')

        if not self.can_add():
>           raise RuntimeError('Can not enqueue. Check the return value of "can_enqueue()" to check if more '
                               'items can be added.')
E           RuntimeError: Can not enqueue. Check the return value of "can_enqueue()" to check if more items can be added.

../reader_impl/pytorch_shuffling_buffer.py:238: RuntimeError
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
ERROR    petastorm.pytorch:pytorch.py:128 Iteration on Petastorm DataLoader raise error: RuntimeError('Can not enqueue. Check the return value of "can_enqueue()" to check if more items can be added.')
chongxiaoc commented 3 years ago

find the root for this bug: When enabling in-mem caching, a secondary shuffling queue is created: https://github.com/uber/petastorm/blob/15b35798d6140efe90f8467072dd55d12a8f79c1/petastorm/pytorch.py#L345

This secondary shuffling queue is set the same size as normal shuffling queue.

However, during the iteration of reading file, normal shuffling queue will increase when adding rows and decrease when producing shuffled batches. However, this secondary shuffling queue is only increasing to cache all data, and it will eventually explode once feeding more data than its size.

To fix it, I think the unit test should set shuffling queue size >= number of rows.

Also add some comments to mention this note in pytorch.py.

I will draft a fix soon.