tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.69k stars 1.66k forks source link

Large memory consumption 0.4 #766

Open plooney opened 6 years ago

plooney commented 6 years ago

I have just upgraded to Tensorflow 1.4 and Tensorboard 0.4. I had Tensorboard running for 20 hours. It was consuming 10GB of memory. I shut it down and restarted. It memory consumption is increasing steadily at ~10MB per second.

lucasb-eyer commented 6 years ago

I observe the same behaviour, especially when there's a lot of data, be it many small experiments, or few long running ones. I've had 64GB memory systems starting to swap after a while when opening 2-3 such tensorboards.

zaxliu commented 6 years ago

Same observation here.

weberxie commented 6 years ago

What's the progress of this issue now? @jart can you elaborate on the reason for this problem?

jart commented 6 years ago

Any chance you guys could post tensorboard --inspect --logdir mylogdir?

zaxliu commented 6 years ago

Hi @jart , here's the shell output of tensorboard --inspect --logdir mylogdir for one of my experiments. sample.txt

mattphillipskitware commented 6 years ago

Checking in, I'm also getting this in spades and I have to kill tensorboard at least once a day to keep it from grinding everything to a halt.

mattphillipskitware commented 6 years ago

tensorboard_log.txt

Here's one, this only got to about 2GB RAM before I shut it down. Other instances have gotten to 10GB as others have reported.

Sylvus commented 6 years ago

Same here (currently at 6GB). Is there a flag to disable loading the graph for example?

igorgad commented 6 years ago

Hi, I am also observing this behavior. Is this fixed in the 1.5 version?

mingdachen commented 6 years ago

Any updates? Or anybody found a workaround for this problem?

plooney commented 6 years ago

In tensorboard 1.5 the issue is still there. Memory consumption is increasing steadily at ~10MB per second.. Here is the output of

tensorboard --inspect --logdir mylogdir

sample.txt

jason-morgan commented 6 years ago

I am having this same issue. The model is a simple LSTM that uses a pre-trained 600k x 300 dimension word embedding. I have 16 model versions and Tensorboard quickly consumes all 64Gb of memory on my machine. I am running Tensorboard 1.5. Here is the inspection log.

inspection.txt

Sylvus commented 6 years ago

What helped in my case was never saving the graph. Make sure you do not add the graph somewhere and also pass graph=None to the FileWriter. Not a real solution, but maybe it helps.

po0ya commented 6 years ago

+1

cancan101 commented 6 years ago

any news on this?

jart commented 6 years ago

We're currently working on having a DB storage layer that puts information like the graphdef on disk rather than in memory. We'd be happy to accept a contribution that, for example, adds a flag to not load the GraphDef into memory, or perhaps saves a pointer to its file in memory to load it on demand, since the GraphDef is usually the very first thing inside an event log file.

inoryy commented 6 years ago

Unfortunately graph=None to the FileWriter didn't solve the issue, running quite quickly out even with just a few models.

sharvil commented 6 years ago

I'm also experiencing this issue with TensorBoard 1.9. Evicting GraphDef from memory might be an okay short-term solution but it's a fixed size, so it should only save a constant amount of memory. The problem for me is memory growth over time.

@jart is someone actively looking into this issue? It's fine if the answer is no, just want to understand where things are. Also, is there any additional information the community can provide to help diagnose what's going on?

rom1504 commented 6 years ago

I'm having the same thing from tensorboard 1.10

mzhaoshuai commented 5 years ago

I also have the same thing from tensorboard 1.12. The tensorboard will occupy more and more memory as the time goes by.
I run it on a server, it finally occupied up to 60GB memory...

Later I use an alternative measure:

# sleep time, hours
sleep_t=6
times=0

# while loop
while true
do
    tensorboard --logdir=${logdir} --port=${port} &
    last_pid=$!
    sleep ${sleep_t}h
    kill -9 ${last_pid}
    times=`expr ${times} + 1`
    echo "Restart tensorboard ${times} times."
done

Kill and restart the tensorboard periodly......

Z-Zheng commented 5 years ago

I also have the same thing from tensorboard 1.12. The tensorboard will occupy more and more memory as the time goes by. I run it on a server, it finally occupied up to 60GB memory...

I also meet this problem with 70+GB :(

rex-yue-wu commented 5 years ago

Guess what? I encounter the same issue, the only difference here is that I ran tensorboard on a server with 512GB memory, and yeah tensorboard ate all memory!!!

rom1504 commented 5 years ago

Yeah I'm confused why nobody cares about this issue. A memory leak of this magnitude make that tool basically useless.

On Fri, Feb 15, 2019, 00:11 rex-yue-wu notifications@github.com wrote:

Guess what? I encounter the same issue, the only difference here is that I ran tensorboard on a server with 512GB memory, and yeah tensorboard ate all memory!!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorboard/issues/766#issuecomment-463840408, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPN_r2wGJyPoXdpSGE66JljhNmAp7xXks5vNe0cgaJpZM4QnPHP .

pietroastolfi commented 5 years ago

Same problem (I'm using TB 1.12.2). Any news?

GalOshri commented 5 years ago

Does anyone have a set of logs they can share that reliably reproduces this issue?

I know a few people have tried reproducing this with several examples but did not notice steadily increasing memory. A repro would be really helpful in starting to investigate this!

nfelt commented 5 years ago

It would also be especially helpful if this can be reproduced using the latest stable TensorBoard version (1.13.0) or our nightly releases (tb-nightly), and in an environment that we have access to - e.g. a docker container or a standard GCP VM instance.

TylerADavis commented 5 years ago

What type of memory consumption should be considered "normal" for tensorboard?

I've got an instance of tensorboard 1.13.1 that's been running for about a month now, pointed at a logdir that holds info from 127 different runs (the logdir is 3.9 GB on disk), and it's currently consuming 21.8 GB of memory (going by the res column in htop).

Memory usage does appear to be increasing even when I am not actively doing any runs, at a rate of about 1 MB a minute.

TylerADavis commented 5 years ago

As an update, over the last week tensorboard's memory utilization has grown from 21.8 to 27 GB, despite not doing any additional runs

ismael-elatifi commented 5 years ago

It seems Tensorboard (at least versions below 1.14) is always loading all logs into RAM even if they are deactivated in the UI. So the more runs/logs we have, the higher is the memory consumption. It forces us to clean our log folder once in a while to reduce the memory taken by Tensorboard. A good thing would be to load in memory only logs for runs selected in the Tensorboard UI. That way the memory consumption will stay constant if we just select the last few runs, even if we have thousands of runs in the Tensorboard folder. A warning could be raised if one tries to load too many runs (if memory consumption reaches a limit specified in the TB configuration). Another good thing would be to release the memory if the TB server has not received any ping from the TB client for a long time. That way it would prevent a not used TB server to eat up all the memory. Apart from that, there could also exist a real memory leak...

paulguerrero commented 5 years ago

I have the same issue, Tensorboard 1.14 was at 24 GB after running for about a day. This can't be only due to Tensorboard loading all logs into memory, since the total size of my logs on disk is between 1 and 2 GB. After restarting and waiting for all data to be shown in the UI, Tensorboard now uses 400 MB (although it will probably grow again over time).

paulguerrero commented 5 years ago

Here is a set of logs that reliably causes Tensorboard's memory to increase over time for me (after around 1-2 days it uses several 10s of GBs): https://drive.google.com/file/d/1h16uu2GsW5qFNLzuqIu1HV7rkNJEUosR/view?usp=sharing

The log directory contains a lot of non-Tensorboard files as well, maybe this is what causes the issue? I usually leave a browser tab with the Tensorboard client open in the background as well. I run Tensorboard 1.14 on Ubuntu 16.04 in a docker container. The log directory is in a volume mounted in the container. Let me know if you could use any other information.

rom1504 commented 5 years ago

I don't think any contributor care about this issue. Nobody has investigated although it has been open for years. So don't expect much to happen. (Except if someone wants to pay for support or something)

On Sun, Aug 25, 2019, 17:09 paulguerrero notifications@github.com wrote:

Here is a set of logs that reliably causes Tensorboard's memory to increase over time for me (after around 1-2 days it uses several 10s of GBs):

https://drive.google.com/file/d/1h16uu2GsW5qFNLzuqIu1HV7rkNJEUosR/view?usp=sharing

The log directory contains a lot of non-Tensorboard files as well, maybe this is what causes the issue? I usually leave a browser tab with the Tensorboard client open in the background as well. I run Tensorboard 1.14 on Ubuntu 16.04 in a docker container. The log directory is in a volume mounted in the container. Let me know if you could use any other information.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorboard/issues/766?email_source=notifications&email_token=AAR437TTAST4TFNRZUNY3HTQGKOEBA5CNFSM4EE46HH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CVLDY#issuecomment-524637583, or mute the thread https://github.com/notifications/unsubscribe-auth/AAR437W5P66J6VRR7AJRZT3QGKOEBANCNFSM4EE46HHQ .

wchargin commented 5 years ago

I don't think any contributor care about this issue

@rom1504: We care. I ran some profiling last week over a couple of days, and will continue investigating as time permits:

https://github.com/wchargin/tensorboard-memory-profiling

sharvil commented 5 years ago

This issue brought my machine to a grind today. Tensorboard chewed through ~ 50GB of RAM. @wchargin I appreciate you looking into this issue. Is it possible to raise its priority to P1 so you or someone on the team can carve out the necessary time to chase this issue down?

ntasfi commented 5 years ago

Seeing the same issue. Leaving it running has it consume 56/64gb of ram on my machine.

sharvil commented 5 years ago

Ran into this issue again today. Tensorboard is basically unusable for me due to this bug.

perone commented 4 years ago

Same here, I'm using tensorboard 2.0.0 and when I open two log directories with 6MB each, it uses 10GB of memory.

justasz commented 4 years ago

Using tensorboard 2.0.0. I have 900 different logs which takes about 200MB of hard disk space. When I start tensorboard, RAM consumption increases by 30GB.

BostonLobster commented 4 years ago

I encountered the same issue. My remote server has 128GB memory, and Tensorboard ate 35% of them, causing other programs halted.

sharvil commented 4 years ago

Happy 2nd birthday #766! 🎂🥳🎉

Look at how big you are now! You've eaten all the RAM we've given you like a good little bug, and you've grown stronger for it. Best wishes, and see you again in 2020.

Lots of love, Tensorflow community

perone commented 4 years ago

The problem still remains in the latest Tensorboard v2.1.0. I have ~10MB of log files and Tensorboard is allocating 13GB of RAM.

nfelt commented 4 years ago

Hi folks - we're trying to get to the bottom of this, and we're sorry it's been such a longstanding problem.

For those of you on the thread who have experienced this, it would really help if you can comment with the following information:

OscarVanL commented 4 years ago

Hi, I was about to open a new issue for this but found you're already working on it. In my case, Tensorboard used 12GB of RAM and 20% of my CPU resources. I'll provide the details you asked for.

  1. tensorboard==2.0.2

  2. tensorflow-gpu==2.0.0, tensorflow-estimator==2.0.1

  3. 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)]

  4. Windows 10 Education (same as Enterprise) Version 1909

  5. pip install within my conda environment

  6. It's static, I performed the tuning on a different machine, then copied my logdir hparam-tuning to my machine, then opened it with tensorboard --logdir C:\Users\Oscar\PycharmProjects\________\hparam-tuning on my own PC to view the results. I have attached the logdir hparam-tuning.zip It is 255 iterations big, 26.7MB unzipped.

    • Size at startup: Immediately blows up to 12GB usage within 10 seconds of starting up the program. Stops expanding after ~30 seconds, but RAM usage is sustained high and using 20% CPU continually.
    • Rate of increase: Haven't been running it for longer than 5 minutes, I can't see how it could grow much more though... lol
    • Task Manager in Windows
  7. I do have the tab open, the auto-refresh behaviour is every 30s.

Additional:

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version d515ab103e2b1cfcea2b096187741a0eeb8822ef --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=7, micro=5, releaselevel='final', serial=0) INFO: os.name: nt INFO: os.uname(): N/A INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=18363, platform=2, service_pack='') --- check: package_management INFO: has conda-meta: True INFO: $VIRTUAL_ENV: None --- check: installed_packages WARNING: Could not generate requirement for distribution -ensorflow-gpu 2.0.0 (c:\users\oscar\anaconda3\envs\_________\lib\site-packages): Parse error at "'-ensorfl'": Expected W:(abcd...) INFO: installed: tensorboard==2.0.2 INFO: installed: tensorflow-gpu==2.0.0 INFO: installed: tensorflow-estimator==2.0.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.0.2' --- check: tensorflow_python_version 2019-12-20 09:58:34.346839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll INFO: tensorflow.__version__: '2.0.0' INFO: tensorflow.__git_version__: 'v2.0.0-rc2-26-g64c3d382ca' --- check: tensorboard_binary_path INFO: Could not find files for the given pattern(s). INFO: which tensorboard: None --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 0, '', ('::1', 0, 0, 0)), (, , 0, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 0, '', ('::', 0, 0, 0)), (, , 0, '', ('0.0.0.0', 0))] --- check: readable_fqdn INFO: socket.getfqdn(): 'Oscar-XPS-Laptop.lan' --- check: stat_tensorboardinfo INFO: directory: C:\Users\Oscar\AppData\Local\Temp\.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=9570149209514493, st_dev=2585985196, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1576835418, st_mtime=1576835418, st_ctime=1576764046) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['C:\\Users\\Oscar\\Anaconda3\\envs\\____________\\lib\\site-packages']; bad_roots (0): [] --- check: full_pip_freeze WARNING: Could not generate requirement for distribution -ensorflow-gpu 2.0.0 (c:\users\oscar\anaconda3\envs\________________\lib\site-packages): Parse error at "'-ensorfl'": Expected W:(abcd...) INFO: pip freeze --all: absl-py==0.8.1 astor==0.8.1 attrs==19.3.0 backcall==0.1.0 bleach==3.1.0 cachetools==3.1.1 certifi==2019.11.28 chardet==3.0.4 colorama==0.4.1 cycler==0.10.0 decorator==4.4.1 defusedxml==0.6.0 entrypoints==0.3 gast==0.2.2 google-auth==1.8.2 google-auth-oauthlib==0.4.1 google-pasta==0.1.8 grpcio==1.25.0 h5py==2.10.0 idna==2.8 importlib-metadata==1.2.0 ipykernel==5.1.3 ipython==7.10.1 ipython-genutils==0.2.0 ipywidgets==7.5.1 jedi==0.15.1 Jinja2==2.10.3 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==5.3.4 jupyter-console==5.2.0 jupyter-core==4.6.1 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.0 kiwisolver==1.1.0 Markdown==3.1.1 MarkupSafe==1.1.1 matplotlib==3.1.2 mistune==0.8.4 more-itertools==7.2.0 nbconvert==5.6.1 nbformat==4.4.0 notebook==6.0.2 numpy==1.17.4 oauthlib==3.1.0 opt-einsum==3.1.0 pandas==0.25.3 pandocfilters==1.4.2 parso==0.5.1 pickleshare==0.7.5 pip==19.3.1 prometheus-client==0.7.1 prompt-toolkit==3.0.2 protobuf==3.11.1 pyasn1==0.4.8 pyasn1-modules==0.2.7 Pygments==2.5.2 pyparsing==2.4.5 pyrsistent==0.15.6 python-dateutil==2.8.1 pytz==2019.3 pywin32==223 pywinpty==0.5.5 pyzmq==18.1.0 qtconsole==4.6.0 requests==2.22.0 requests-oauthlib==1.3.0 rsa==4.0 Send2Trash==1.5.0 setuptools==42.0.2.post20191203 six==1.13.0 tensorboard==2.0.2 tensorflow-estimator==2.0.1 tensorflow-gpu==2.0.0 termcolor==1.1.0 terminado==0.8.3 testpath==0.4.4 tornado==6.0.3 traitlets==4.3.3 urllib3==1.25.7 wcwidth==0.1.7 webencodings==0.5.1 Werkzeug==0.16.0 wheel==0.33.6 widgetsnbextension==3.5.1 wincertstore==0.2 wrapt==1.11.2 zipp==0.6.0 ``````

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

bileschi commented 4 years ago

Assigning this to @nfelt who is actively looking into this. Please reassign or unassign as appropriate.

nfelt commented 4 years ago

Quick update everyone - we think we've narrowed this down to a memory leak in tf.io.gfile.isdir() which we've reported in TensorFlow as https://github.com/tensorflow/tensorflow/issues/35292.

In terms of a fix, it appears that by pure coincidence a change landed in TensorFlow yesterday that replaces the leaking code, so in our testing we're seeing at least a much lower rate of memory leakage when running TensorBoard against today's tf-nightly==2.1.0.dev20191220.

If you're still seeing the issue, please try running TensorBoard in an environment with that version of TensorFlow (the actual version of TensorFlow you use for generating the log data should not affect this) and let us know if it seems to resolve the issue or not.

We will see what we can do to try to work around the issue so that we can get a fix to you sooner than the next TF release that would include yesterday's change (2.2) - if possible we'll see if we can fix this on the TB side so that those who can't easily update TF to the most recent version have access to a fix.

zaxliu commented 4 years ago

@nfelt hi this is good news, thanks. Curious though: are you planning for an independent tensorboard build with this issue fixed?

adizhol commented 4 years ago

Hi all,

I'm running tensorboard without tensorflow, and I no longer experience the huge memory consumptio.

perone commented 4 years ago

Tried to use the tf-nightly==2.1.0.dev20191220 version, but without success, the same problem still remains.

I noticed that if I add a lot of files inside of the logdir folder, TensorBoard throws an exception:

TensorBoard 2.2.0a20200106 at http://anonymized:9090/ (Press CTRL+C to quit)
Exception in thread Reloader:
Traceback (most recent call last):
  File "/env/lib/python3.7/threading.py", line 917, in _bootstrap_inner
  File "/env/lib/python3.7/threading.py", line 865, in run
  File "/env/lib/python3.7/site-packages/tensorboard/backend/application.py", line 660, in _reload
  File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 202, in AddRunsFromDirectory
  File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/io_wrapper.py", line 213, in <genexpr>
  File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/io_wrapper.py", line 164, in ListRecursivelyViaWalking
  File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 676, in walk_v2
  File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 606, in list_directory
  File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 635, in list_directory_v2
tensorflow.python.framework.errors_impl.ResourceExhaustedError: ./; Too many open files

The memory issue happens without adding a lot of small files inside the logdir as well, but since there is this recursive process opening a lot of files, it might be one of the root causes of this quick memory growth that happens upon starting it (as related to tf.io.gfile.isdir()), but if the fix is really on tf-nightly==2.1.0.dev20191220, then it might be another leak hidden somewhere on these directory/file handling routines.

perone commented 4 years ago

Just to add another comment, if I run:

pip uninstall tf-nightly

As suggested by @adizhol, TensorBoard works fine and takes only 310MB of resident memory, which really seems to solve the issue. So it seems that this is definitely caused by tensorflow code. It gives the warning:

TensorFlow installation not found - running with reduced feature set

Which seems to limit the available features on TensorBoard.

perone commented 4 years ago

Just adding more info, I think I found the culprit.

If you just use (on tensorboard/compat/__init__.py):

from tensorboard.compat.tensorflow_stub import pywrap_tensorflow

To force it to use the pywrap_tensorflow from TensorBoard itself, the memory issue just disappears. However, if you let it import the tensorflow.python.pywrap_tensorflow, which seems to be a swig extension, the memory leak returns. That explains why removing TensorFlow solves the issue. It seems that one method on the TensorFlow's pywrap_tensorflow is leaking a lot of memory.

It changes the memory usage from 16GB to around ~500MB.