Open oweissbarth opened 6 years ago
TensorBoard currently loads all runs into memory even if they aren't initially being displayed, so that when you select or deselect runs it can start showing that data in the UI immediately rather than having to then go crawl through the file.
I'm a bit surprised though that you are running into memory issues because TensorBoard should be keeping only a fixed-size sample of the loaded data in memory, not the full size of the original log directory. In the case of images, it should keep only 10 images per tag and per run: https://github.com/tensorflow/tensorboard/blob/1.6.0/tensorboard/backend/application.py#L59 However, if you currently have a large number of unique tag+run combinations, each with only a few images, that could make the sampling a lot less effective.
The other thing is that with 100GB of logs in the directory, TensorBoard will just take a very long time to load them (even if they fit in memory) since there's only a single thread and that's just a lot of data to process. I agree it'd be useful if it were smarter about prioritizing runs to load, but for now a workaround could be to just create a new log directory and add a symlink to it for each run you want to show. E.g. if you have a log directory with runs like logdir/run1
, logdir/run2
, ... , logdir/run100
and you just want to show runs 1 and 100, you could create a new directory logdir-links
and add within it symlinks to logdir/run1
and logdir/run100
.
Also, we're working on adding support for a SQLite DB backend which should hopefully open up possibilities for loading run data only when needed and avoiding so much memory consumption, but it will still be a little while before that's reading for general use.
Thank you @nfelt for the information. For the experimental SQLite DB backend that you mentioned, is there an example showing how to create and load the DB given large event files? I'm not sure if loader tool should be used or if a different mechanism is available. I noticed the histogram and scalar plugins have recently been updated to support db mode and would like to try out.
@gweidner We don't really have an example yet, but it should be possible to populate a sqlite DB using either the loader.cc tool (built from TF source) or write TF code that uses tf.contrib.summary.create_db_writer()
. Fair warning, it is still quite experimental with the DB schema subject to change, and as far as I know we haven't yet done much testing with large event files, but if you're interested please do try it out and let us know how it goes.
Any news?
+1. Is this still experimental?
Any Updates on the ETA?
@jart from PRs I got a feeling that you're working on adding support for a SQLite DB backend. Is there a way to help? Does the tensorboard team share the progress / current tasks / etc somewhere publicly? 🤔
It seems like a very useful feature set. Any way we can help this expedite with contribution?
Hi folks, thanks for your continued interest and sorry there hasn't been much news. The SQLite DB backend work ran into difficulties and has been on hold for a while, but we're still very much aware that working with large logdirs is a pain point and we're working on more flexible modes of run selection and data management to address this.
Any updates?
i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox
i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox
Add my data point: on a MacBook Pro 2018 13', loading tensorboard from a remote server with many large event logs (a few hundred MBs in total), using Chrome will hang for a while every time I click on the tensorboard page. Then I followed this suggestion and switched to Safari -- Woo! It is just amazing and so responsive!
@nfelt is there any progress on that issue?
Any updates on this issue? Sqlite backend anybody? MLflow integration, perhaps???
I am training a CNN and i use tensorboard for the visualization of the training process and results. As i create lots of image summaries while training the event log file often have size of about 7GB. When i point tensorboard to my runs-directory it seems to load all runs into memory even though non are activated in the ui. All log files in the runs directory total at about 100GB. Therefore loading everything into main memory (32GB on my system) doesn't work. Is there a way to only load log files once the runs are activated (on demand)? Am i missing something? Thank you in advance.