uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

TF Dataset Runs Out Of Data Despite num_epochs=None #432

Closed cupdike closed 5 years ago

cupdike commented 5 years ago

I have a petastorm dataset with 10000+ images. I'm trying to train with 10 epochs and a batch size of 32. Training runs are failing with: Your dataset ran out of data; interrupting training. Make sure that your dataset can generate at leaststeps_per_epoch * epochsbatches (in this case, 3390 batches)

I am applying the num_epochs=None to make_reader which I thought was sufficient to supply "infinite" data:

with make_reader(dataset_url, reader_pool_type='dummy', num_epochs=None) as reader:

    dataset = make_petastorm_dataset(reader)
    dataset = dataset.map(lambda x: tf.numpy_function(func=process_inputs, inp=(x.image1, x.category), Tout=(tf.float32,tf.uint8)))

    dataset = dataset.batch(bs).shuffle(buffer_size=15000) 

Am I doing something wrong here?

cupdike commented 5 years ago

So it appears that there are different behaviors depending on the code that processes the dataset. If I simply iterate over the dataset (for sample in dataset: ...), I see painfully slow loading of the buffer and then a crash when memory is exhausted:

2019-10-02 17:35:16.297163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 1 of 15000
2019-10-02 17:35:23.878243: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 33 of 15000
2019-10-02 17:35:31.980430: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 65 of 15000
2019-10-02 17:35:47.316917: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 129 of 15000
2019-10-02 17:35:55.363270: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 161 of 15000
2019-10-02 17:36:02.425192: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 193 of 15000
Killed

If I pass the dataset into model.fit_generator, it runs over 96 batches of input (3 files worth of data at 1024 samples per file == 96*32) and then stops with: 96/339 [=======>......................] - ETA: 2:58 - loss: 3.7439 - accuracy: 0.1354WARNING:tensorflow:Your dataset ran out of data; interrupting training...

Not sure what to make of this...

selitvin commented 5 years ago

What is the size of a single record in you case (a row with your uncompressed image)? Shuffling queue of 15000 records might be quite big... Tried to reproduce your scenario by running petastorm hello world example:

examples/hello_world/petastorm_dataset/tensorflow_hello_world.py

    # Example: use tf.data.Dataset API
    with make_reader(dataset_url, num_epochs=None) as reader:
        dataset = make_petastorm_dataset(reader)
        dataset = dataset.map(lambda x: x)

        dataset = dataset.batch(4).shuffle(buffer_size=150)

        iterator = dataset.make_one_shot_iterator()
        tensor = iterator.get_next()
        with tf.Session() as sess:
            for ii in range(100000):
                sample = sess.run(tensor)
                print(ii, sample.id)

I observed that the print-out messages went on being printed out forever.

cupdike commented 5 years ago

After distilling things down, was able to isolate it as a context manager scope problem (was not encompassing the lifecycle of the dataset)... just like in #396.

If there is no valid reason to work with a dataset after exiting the scope of the context manager, it would be really useful to just throw an exception. Not sure if that's feasible but it would really help.

Also, I had to make a much smaller dataset to even get the make_petastorm_dataset call to run without blowing out memory. Do I need to use sharding when creating the reader (or something similar to that) to be able to avoid blowing out the memory?

Guess I was surprised by this because the dataset is 800 MB of PNG files (80 KB files x ~10k files). Even with the 3x factor of PNG file size to np.array size (at least for the file I looked at), it still seems like there is several times more RAM available than should be required (at least 12 GB RAM is free versus the ~2.5 GB need for the image arrays).

selitvin commented 5 years ago

Thanks for chasing this down. I'll try to add protection against reader usage once it gets out of scope. This is a great idea.

The memory footprint can be calculated by:

* ( * + ). To reduce the memory footprint I would suggest to: - Reduce number of workers (`workers_count`). This is a tunable parameter and reducing it will not necessarily reduce the throughput. It depends on the machine/network setup a lot. - Reduce `results_queue_size` to a smaller value. Getting it to something really small (e.g. `3`) should not have any effect as you have a shuffling queue right after.
cupdike commented 5 years ago

I'll try to add protection against reader usage once it gets out of scope.

Excellent... I think this will help folks getting started.

I saw the discussion under #306 regarding memory consumption. I will start trying your suggestions. I tried limiting to one shard but it still didn't stop memory from blowing out.

selitvin commented 5 years ago

Yep. Would not expect number of shards to help, since the loading is at on per Parquet row-group basis. BTW, reducing the row-group size would also reduce memory footprint (controlled by the row_group_size_mb argument of the materialize_dataset)

cupdike commented 5 years ago

Below I list some experiments I ran on different configurations. Of course, the results are specific to the particular of my dataset, but some observations:

Not sure how generally applicable any of that is. This was just iterating of the data once (not actually using it for training).

Here's the different runs: