tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.72k stars 1.66k forks source link

Dealing with many large event logs #1002

Open oweissbarth opened 6 years ago

oweissbarth commented 6 years ago

I am training a CNN and i use tensorboard for the visualization of the training process and results. As i create lots of image summaries while training the event log file often have size of about 7GB. When i point tensorboard to my runs-directory it seems to load all runs into memory even though non are activated in the ui. All log files in the runs directory total at about 100GB. Therefore loading everything into main memory (32GB on my system) doesn't work. Is there a way to only load log files once the runs are activated (on demand)? Am i missing something? Thank you in advance.

nfelt commented 6 years ago

TensorBoard currently loads all runs into memory even if they aren't initially being displayed, so that when you select or deselect runs it can start showing that data in the UI immediately rather than having to then go crawl through the file.

I'm a bit surprised though that you are running into memory issues because TensorBoard should be keeping only a fixed-size sample of the loaded data in memory, not the full size of the original log directory. In the case of images, it should keep only 10 images per tag and per run: https://github.com/tensorflow/tensorboard/blob/1.6.0/tensorboard/backend/application.py#L59 However, if you currently have a large number of unique tag+run combinations, each with only a few images, that could make the sampling a lot less effective.

The other thing is that with 100GB of logs in the directory, TensorBoard will just take a very long time to load them (even if they fit in memory) since there's only a single thread and that's just a lot of data to process. I agree it'd be useful if it were smarter about prioritizing runs to load, but for now a workaround could be to just create a new log directory and add a symlink to it for each run you want to show. E.g. if you have a log directory with runs like logdir/run1, logdir/run2, ... , logdir/run100 and you just want to show runs 1 and 100, you could create a new directory logdir-links and add within it symlinks to logdir/run1 and logdir/run100.

Also, we're working on adding support for a SQLite DB backend which should hopefully open up possibilities for loading run data only when needed and avoiding so much memory consumption, but it will still be a little while before that's reading for general use.

gweidner commented 6 years ago

Thank you @nfelt for the information. For the experimental SQLite DB backend that you mentioned, is there an example showing how to create and load the DB given large event files? I'm not sure if loader tool should be used or if a different mechanism is available. I noticed the histogram and scalar plugins have recently been updated to support db mode and would like to try out.

nfelt commented 6 years ago

@gweidner We don't really have an example yet, but it should be possible to populate a sqlite DB using either the loader.cc tool (built from TF source) or write TF code that uses tf.contrib.summary.create_db_writer(). Fair warning, it is still quite experimental with the DB schema subject to change, and as far as I know we haven't yet done much testing with large event files, but if you're interested please do try it out and let us know how it goes.

bhack commented 6 years ago

Any news?

amj commented 6 years ago

+1. Is this still experimental?

zishanahmed08 commented 5 years ago

Any Updates on the ETA?

dimart commented 5 years ago

@jart from PRs I got a feeling that you're working on adding support for a SQLite DB backend. Is there a way to help? Does the tensorboard team share the progress / current tasks / etc somewhere publicly? 🤔

nav13n commented 5 years ago

It seems like a very useful feature set. Any way we can help this expedite with contribution?

nfelt commented 4 years ago

Hi folks, thanks for your continued interest and sorry there hasn't been much news. The SQLite DB backend work ran into difficulties and has been on hold for a while, but we're still very much aware that working with large logdirs is a pain point and we're working on more flexible modes of run selection and data management to address this.

Strateus commented 4 years ago

Any updates?

zishanahmed08 commented 4 years ago

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

jayleicn commented 3 years ago

i found a workaround in my case which might be useful for others. switch from chrome/chromium to mozilla firefox

Add my data point: on a MacBook Pro 2018 13', loading tensorboard from a remote server with many large event logs (a few hundred MBs in total), using Chrome will hang for a while every time I click on the tensorboard page. Then I followed this suggestion and switched to Safari -- Woo! It is just amazing and so responsive!

MaLiN2223 commented 3 years ago

@nfelt is there any progress on that issue?

LarsDu commented 1 year ago

Any updates on this issue? Sqlite backend anybody? MLflow integration, perhaps???