nathanin / micron2

0 stars 0 forks source link

MoCo/unsupervised training efficiency #22

Open nathanin opened 3 years ago

nathanin commented 3 years ago

Handle hdf5 input efficiently.

Options:

Check the reference implementations for clues

nathanin commented 3 years ago
interleave(
    map_func, cycle_length=None, block_length=None, num_parallel_calls=None,
    deterministic=None
)

IODataset.interleave might be useful:

# Preprocess 4 files concurrently, and interleave blocks of 16 records
# from each file.
filenames = ["/var/data/file1.txt", "/var/data/file2.txt",
             "/var/data/file3.txt", "/var/data/file4.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
def parse_fn(filename):
  return tf.data.Dataset.range(10)
dataset = dataset.interleave(lambda x:
    tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
    cycle_length=4, block_length=16)

https://www.tensorflow.org/io/api_docs/python/tfio/v0/IODataset#interleave

nathanin commented 3 years ago

Switched to graph mode for ~4X speed up.

Still need to test out different dataset formats

1cd9354

nathanin commented 3 years ago

Reopening because there's a slow memory leak in graph mode.

Trying to eliminate variables and rule out things like the data pipeline..

nathanin commented 3 years ago

https://github.com/nathanin/micron2/blob/b04cc0a11bcc0509063f55fc2d16c105ca57262c/micron2/data/load_nuclei.py#L64

This data loader with graph mode seems to work on a smaller dataset, setting no repeats and using the tf.keras.Model.fit epoch argument to control the length of training.