Closed cupdike closed 5 years ago
So it appears that there are different behaviors depending on the code that processes the dataset. If I simply iterate over the dataset (for sample in dataset: ...
), I see painfully slow loading of the buffer and then a crash when memory is exhausted:
2019-10-02 17:35:16.297163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 1 of 15000
2019-10-02 17:35:23.878243: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 33 of 15000
2019-10-02 17:35:31.980430: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 65 of 15000
2019-10-02 17:35:47.316917: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 129 of 15000
2019-10-02 17:35:55.363270: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 161 of 15000
2019-10-02 17:36:02.425192: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 193 of 15000
Killed
If I pass the dataset into model.fit_generator, it runs over 96 batches of input (3 files worth of data at 1024 samples per file == 96*32) and then stops with:
96/339 [=======>......................] - ETA: 2:58 - loss: 3.7439 - accuracy: 0.1354WARNING:tensorflow:Your dataset ran out of data; interrupting training...
Not sure what to make of this...
What is the size of a single record in you case (a row with your uncompressed image)? Shuffling queue of 15000 records might be quite big... Tried to reproduce your scenario by running petastorm hello world example:
examples/hello_world/petastorm_dataset/tensorflow_hello_world.py
# Example: use tf.data.Dataset API
with make_reader(dataset_url, num_epochs=None) as reader:
dataset = make_petastorm_dataset(reader)
dataset = dataset.map(lambda x: x)
dataset = dataset.batch(4).shuffle(buffer_size=150)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
with tf.Session() as sess:
for ii in range(100000):
sample = sess.run(tensor)
print(ii, sample.id)
I observed that the print-out messages went on being printed out forever.
After distilling things down, was able to isolate it as a context manager scope problem (was not encompassing the lifecycle of the dataset)... just like in #396.
If there is no valid reason to work with a dataset after exiting the scope of the context manager, it would be really useful to just throw an exception. Not sure if that's feasible but it would really help.
Also, I had to make a much smaller dataset to even get the make_petastorm_dataset call to run without blowing out memory. Do I need to use sharding when creating the reader (or something similar to that) to be able to avoid blowing out the memory?
Guess I was surprised by this because the dataset is 800 MB of PNG files (80 KB files x ~10k files). Even with the 3x factor of PNG file size to np.array size (at least for the file I looked at), it still seems like there is several times more RAM available than should be required (at least 12 GB RAM is free versus the ~2.5 GB need for the image arrays).
Thanks for chasing this down. I'll try to add protection against reader usage once it gets out of scope. This is a great idea.
The memory footprint can be calculated by:
I'll try to add protection against reader usage once it gets out of scope.
Excellent... I think this will help folks getting started.
I saw the discussion under #306 regarding memory consumption. I will start trying your suggestions. I tried limiting to one shard but it still didn't stop memory from blowing out.
Yep. Would not expect number of shards to help, since the loading is at on per Parquet row-group basis. BTW, reducing the row-group size would also reduce memory footprint (controlled by the row_group_size_mb
argument of the materialize_dataset
)
Below I list some experiments I ran on different configurations. Of course, the results are specific to the particular of my dataset, but some observations:
Not sure how generally applicable any of that is. This was just iterating of the data once (not actually using it for training).
Here's the different runs:
I have a petastorm dataset with 10000+ images. I'm trying to train with 10 epochs and a batch size of 32. Training runs are failing with:
Your dataset ran out of data; interrupting training. Make sure that your dataset can generate at least
steps_per_epoch * epochsbatches (in this case, 3390 batches)
I am applying the
num_epochs=None
to make_reader which I thought was sufficient to supply "infinite" data:Am I doing something wrong here?