tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.66k stars 1.65k forks source link

SummaryWriter/TensorBoard should use a more efficient format #641

Open lespeholt opened 6 years ago

lespeholt commented 6 years ago

The current format used becomes slow in many circumstances:

  1. Saving images/audio/...
  2. Saving many scalars

I suggest that the format is replaced by something that:

  1. Is compressed (and don't save tags for every event)
  2. Is columnar (don't load images if you only look at scalars)
  3. Easy to read and load from Python (efficiently)
wchargin commented 6 years ago

Images are PNG-encoded, which actually seems fine to me.

Audio, on the other hand, uses raw WAV, which is an enormous waste of space. Our plan is to switch that to FLAC, but to do that we need to get a FLAC encoder into TensorFlow core, and we just haven't gotten around to that.

Regarding columnarity: we're working on that, too. @jart has been working on a project to allow TensorBoard to use a SQL datastore instead of an in-memory datastore; you can read about it in the description of #293. This should scale very well. (See also #92.)

jart commented 6 years ago

Thank you for the feedback @lespeholt. I'm actively working on a SQL database that does all the things you mentioned. Most of the space saving is probably going to come from reservoir sampling. While a normalized SQL format is able to save on things like tag strings, it does introduce new types of storage overhead that the proto event logs don't have.