Open Omitg24 opened 2 weeks ago
I have nothing to provide you but solidarity. I am running into this same problem with a TFRecords data pipeline:
def _parse_function(example_proto):
feature_description = {
'ny' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'nx' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'ntp' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'ntf' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'ncp' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'ncf' : tf.io.FixedLenFeature([], tf.int64, default_value = 0),
'priors' : tf.io.FixedLenFeature([], tf.string, default_value = ''),
'forecasts' : tf.io.FixedLenFeature([], tf.string, default_value = ''),
}
features = tf.io.parse_example(example_proto, feature_description)
priors = tf.io.parse_tensor(features['priors'], tf.float32)
forecasts = tf.io.parse_tensor(features['forecasts'], tf.float32)
ny = features['ny']
nx = features['nx']
ntp = features['ntp']
ntf = features['ntf']
ncp = features['ncp']
ncf = features['ncf']
priors = tf.reshape(priors, shape = [ntp, ny, nx, ncp])
forecasts = tf.reshape(forecasts, shape = [ntf, ny, nx, ncf])
return priors, forecasts
...
def create_dataset_onr_tfrecords(path,
glob,
batch_size = 32,
compression = 'GZIP',
shuffle = True,
deterministic = False):
return tf.data.Dataset.list_files(str(path / glob), shuffle = shuffle).interleave(
lambda x: tf.data.TFRecordDataset(x, compression_type = compression),
cycle_length = tf.data.AUTOTUNE,
num_parallel_calls = tf.data.AUTOTUNE,
deterministic = deterministic
).map(
_parse_function,
num_parallel_calls = tf.data.AUTOTUNE
).batch(
batch_size, drop_remainder = True
).prefetch(tf.data.AUTOTUNE)
I'll spare you the plot, but I am having the same issue with a vanilla TF dataset. I've tried removing interleave, removing GZIP compression, calling TFRecordDataset directly, removed batching, removed prefetching... nothing.
I believe this is a Tensorflow problem and (in particular) a TF Dataset problem: https://github.com/tensorflow/tensorflow/issues/65675
This TF 2.16 + K3 era has been a disaster. Not the Keras part -- just some growing pains. But TF, man...
I am facing the same problem, using scripts from here: https://github.com/kpertsch/rlds_dataset_mod
which also involves with certain features from tensorflow dataset. The scripts is intended to do some modifications to an existing tensorflow dataset stored in TFRecord format.
For the past 3 weeks I've been searching nonstop for a solution to this problem, when training a LSTM model with a custom DataGenerator, Keras ends up using all my RAM memory. The context of the project is to predict sleep stages, in this script, its expected to paralelyze 15 different participants with its 10 folds (10 train and 10 validation), and in a following phase test with its respective partition. Having said that, this is the LSTM Network I'm currently using:
I'm using:
This network has been used in this project
Then, I've implemented this custom DataGenerator which suites my problem.
And finally, the training phase is the following:
With that, I have this output file (I'm showing the first and the last epoch) where we can see how it ends up spending 80GBs of RAM on just one participant with 10 epochs and 10 folds.
I've tried to explicitly delete variables, also calling garbace collector and using clear_session() after finishing training each model, since its an incremental training, I think I'm not suposed to use it between folds.
Finally, if this could help proving my issue, I've also tried to see what would print a memory_profiler, just in case it was really freeing memory (but not the necessary), this is the result for one epoch 10 folds on one participant.
Hope someone knows how to fix this issue. Thanks!
What I've tried
I've tried reading the folds just when needed, explicitly freeing memory by deleting variables and calling garbage_collector, using different techniques of paralelization, but I've always faced the issue of one single participant consuming too much memory to handle.