neuroailab / tfutils

Utilities for working with tensorflow
MIT License
25 stars 8 forks source link

Improvements to data provider #48

Closed damro closed 7 years ago

damro commented 7 years ago

1. dict of dtypes and shapes as input or via convention out of saved metadata in a pkl file (or whatever); specified one has priority

  1. imagelist -> shapedict and apply to all keys, not just "image" keys, or combine sourcedict with shapes and dtypes
  2. make a general ParallByFile class independent of a data reader that takes the reader as input (TFRecords provider, HDF5 provider, LMDB provider etc.)
  3. add a postprocessing argument to the ParallelByFile base class on strings top level provider on arrays (if specified)
  4. instead of separate py_postprocess let user do the work or provide a convenience method that modifies postprocess dict and is called in postprocess_many
  5. multiple filename queues that are joined upon after reading one batch
  6. allow shuffle filename_queue as option and then allow passing a seed (or default to seed=0)
  7. Make old tests run and create new tests
  8. Update sonify in base.py to accept tf.dtypes
  9. Think about how to restore data reading state
  10. Parallelize data provider on attribute.
nhaber commented 7 years ago

On #8, need to think about how to deal with the case where filenames provided do not line up w corresponding data -- unclear whether there is a check.

-check that all filenames lists are the same length -check that corresponding elements have the same number of records

qbilius commented 7 years ago

How is this issue progressing?

I'm thinking of implementing a benchmark test comparing TFRecords / HDF5 reading from the filename queue, varying the number of reading and preprocessing threads and whether data is fed into a placeholder or directly. Seems like it would be generally useful to see how fast each combination is.

yamins81 commented 7 years ago

We're nearing completion on this. It turned out to have some nontrivial technical challenges. I should have something by later today or tomorrow.