tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.67k stars 1.65k forks source link

Too many open files under RustBoard (EMFILE) #4955

Open wchargin opened 3 years ago

wchargin commented 3 years ago

I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?

2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }

I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   6210 tokio-run
   6210 Reloader-
   1035 StdinWatc
   1035 server
   1035 Reloader
    184 gmain
    168 gdbus
    134 grpc_glob
     85 bash
     80 snapd

Compared to <500 in "slow" mode.

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
    427 tensorboa
    184 gmain
    168 gdbus
     85 bash
     80 snapd
     72 systemd
     71 screen
     52 dconf\x20
     51 dbus-daem
     48 llvmpipe-

In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.

Originally posted by @Raphtor in https://github.com/tensorflow/tensorboard/issues/4784#issuecomment-838599948

wchargin commented 3 years ago

Hi @Raphtor! Thanks for the report and the helpful info. Some questions:

Raphtor commented 3 years ago

Thanks for your quick response! Here are the things you requested.

diagnose_tensorboard.py output

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='nai-testing-2', release='4.15.0-143-generic', version='#147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: '[VIRTUAL ENV]' --- check: installed_packages INFO: installed: tb-nightly==2.6.0a20210510 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly'] INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.6.0a20210510' --- check: tensorflow_python_version Traceback (most recent call last): File "diagnose_tensorboard.py", line 522, in main suggestions.extend(check()) File "diagnose_tensorboard.py", line 75, in wrapper result = fn() File "diagnose_tensorboard.py", line 278, in tensorflow_python_version import tensorflow as tf ModuleNotFoundError: No module named 'tensorflow' --- check: tensorboard_data_server_version INFO: data server binary: '[VIRTUAL ENV]/lib/python3.6/site-packages/tensorboard_data_server/bin/server' Traceback (most recent call last): File "diagnose_tensorboard.py", line 522, in main suggestions.extend(check()) File "diagnose_tensorboard.py", line 75, in wrapper result = fn() File "diagnose_tensorboard.py", line 301, in tensorboard_data_server_version check=True, File "/usr/lib/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: TypeError: __init__() got an unexpected keyword argument 'capture_output' --- check: tensorboard_binary_path INFO: which tensorboard: b'[VIRTUAL ENV]/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('::1', 0, 0, 0)), (, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): '[HOST]' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=6431253, st_dev=64768, st_nlink=2, st_uid=1003, st_gid=1003, st_size=4096, st_atime=1620705993, st_mtime=1620752382, st_ctime=1620752382) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['[VIRTUAL ENV]/lib/python3.6/site-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==0.12.0 aiohttp==3.7.4.post0 aiohttp-cors==0.7.0 aioredis==1.3.1 async-timeout==3.0.1 attrs==21.2.0 blessings==1.7 cachetools==4.2.2 certifi==2020.12.5 chardet==4.0.0 click==7.1.2 colorama==0.4.4 contextvars==2.4 cycler==0.10.0 dataclasses==0.8 filelock==3.0.12 google-api-core==1.26.3 google-auth==1.30.0 google-auth-oauthlib==0.4.4 googleapis-common-protos==1.53.0 gpustat==0.6.0 grpcio==1.37.1 hiredis==2.0.0 idna==2.10 idna-ssl==1.1.0 immutables==0.15 importlib-metadata==4.0.1 joblib==1.0.1 jsonschema==3.2.0 kiwisolver==1.3.1 Markdown==3.3.4 matplotlib==3.1.0 msgpack==1.0.2 multidict==5.1.0 nengo==3.1.0 numpy==1.19.5 nvidia-ml-py3==7.352.0 oauthlib==3.1.0 opencensus==0.7.12 opencensus-context==0.1.2 packaging==20.9 pandas==1.1.5 pip==21.1.1 prometheus-client==0.10.1 protobuf==3.16.0 psutil==5.8.0 py-spy==0.3.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyparsing==2.4.7 pyrsistent==0.17.3 python-dateutil==2.8.1 pytz==2021.1 PyYAML==5.4.1 ray==1.3.0 redis==3.5.3 requests==2.25.1 requests-oauthlib==1.3.0 rsa==4.7.2 scikit-learn==0.23.2 scipy==1.5.2 setuptools==56.0.0 six==1.16.0 tabulate==0.8.9 tb-nightly==2.6.0a20210510 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 tensorboardX==2.2 threadpoolctl==2.1.0 torch==1.7.1 tqdm==4.60.0 typing-extensions==3.10.0.0 urllib3==1.26.4 Werkzeug==1.0.1 wheel==0.36.2 yarl==1.6.3 zipp==3.4.1 ``````

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

logdir structure

I am using ray[tune] to tune hyperparameters for the training of a RNN in pytorch. It produces a structure sort of like:

- experiment/
     - basic-variant-state-[datetime].json        
     -  experiment_state-[datetime].json                                                       
     -  '[function name]_[hyperparameters set 1]_[datetime]'/
          - events.out.tfevents.1620750386.nai-testing-2
          - params.json
          - params.pkl
          - progress.csv
          - result.json
      ....
      -  '[function name]_[hyperparameters set 2000]_[datetime]'/

The files in the runs are each <50KB

Using multifile mode

Seems not to reproduce the issue.

lsof version

I am using lsof 4.89 -- the whole command lists all FDs, sorts and counts the unique FDs for each process name. However, I am no longer sure that its accurately counting what we want, since running it just now with Ray running shows that over 20K open FDs but no error...

wchargin commented 3 years ago

An easy workaround here would be to just increase the fd cap to the hard limit. On my system, ulimit -n -S (the soft limit) is 1024, but ulimit -n -H (the hard limit) is 1048576. I don’t think that raising this should have any adverse effects: since we never touch fds directly, any such bugs would have to be in our dependencies, which are all fairly widely used.

At a glance, pulling in the rlimit crate and adding something like

fn increase_fd_limit() -> std::io::Result<()> {
    #[cfg(unix)]
    {
        use rlimit::Resource;
        let (old_soft_limit, hard_limit) = Resource::NOFILE.get()?;
        Resource::NOFILE.set(hard_limit, hard_limit)?;
        debug!("Changed file descriptor limit from {} to {}", old_soft_limit, hard_limit);
    }
    #[cfg(not(unix))]
    {
        debug!("Non-Unix; leaving file descriptor limit alone");
    }
    Ok(())
}

fn try_increase_fd_limit() {
    if let Err(e) = increase_fd_limit() {
        warn!("Failed to increase file descriptor limit: {}", e);
    }
}

to cli.rs and calling it from main should do the trick.

Tilps commented 2 years ago

I've also run into this problem after an upgrade. I have 10s of thousands of runs however... Attempting to increase the file limit seemed to get some weird behaviour where it stopped being able to be served remotely (but maybe I screwed something else up...) I've gone back to slow loading for now...

janacht commented 2 years ago

I've also encountered the problem and found that raising the "open files" limit by executing e.g.

ulimit -n 50000

solves the problem for me (without requiring superuser permissions).

AlexanderYaroshevichIAC commented 1 year ago

fyi ulimit only works for the current shell session