Open wchargin opened 3 years ago
Hi @Raphtor! Thanks for the report and the helpful info. Some questions:
Could you run diagnose_tensorboard.py
in the same environment
from which you usually run TensorBoard and post the full output in a
comment (sanitizing as desired if you want to redact anything)?
Are you able to share the log directory with us? If not, could you describe the structure of the event files? You say that you only have ~2000 runs, but I wonder if each run tends to have many event files (can happen if your training workers restart a lot). If so, it’s possible that that explains the difference, since the particulars around how we handle multiple event files in the same directory differ somewhat.
Broadly, there are three potential behaviors. In all cases, we read all event files in lexicographical order. When we hit EOF on an event file, we keep polling it iff…
TensorBoard with --load_fast=false
uses last-file mode by default
(and can also be told to use multifile mode), but with
--load_fast=true
uses all-files mode.
Can you also reproduce the issue when running TensorBoard with
--load_fast=false --reload_multifile=true --reload_multifile_inactive_secs=-1
? Same train of thought as above; this enables multifile mode with
an unbounded age threshold, making it equivalent to all-files mode.
If this reproduces the issue, we can probably fix this by making
--load_fast=true
also implement last-file and/or multifile modes,
which would be nice, anyway.
What lsof
do you have? My lsof
(4.93.2, Linux) uses the first
column for the command name, but (e.g.) tensorboard
and bash
are
process names whereas Reloader
and StdinWatcher
are thread
names. So my lsof
output has lines like:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
server 692802 wchargin 11r REG 254,1 11096888 15361542 /HOMEDIR/tensorboard_data/mnist/lr_1E-03,conv=2,fc=2/events.out.tfevents.1563406405.HOSTNAME
…and I don’t see how your lsof | awk '{ print $1 }'
is giving the
output that you’re seeing. Probably just a reporting thing, but I’d
like to be able to reproduce your interaction if possible.
Thanks for your quick response! Here are the things you requested.
diagnose_tensorboard.py
outputNo action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.
I am using ray[tune] to tune hyperparameters for the training of a RNN in pytorch. It produces a structure sort of like:
- experiment/
- basic-variant-state-[datetime].json
- experiment_state-[datetime].json
- '[function name]_[hyperparameters set 1]_[datetime]'/
- events.out.tfevents.1620750386.nai-testing-2
- params.json
- params.pkl
- progress.csv
- result.json
....
- '[function name]_[hyperparameters set 2000]_[datetime]'/
The files in the runs are each <50KB
Seems not to reproduce the issue.
I am using lsof 4.89 -- the whole command lists all FDs, sorts and counts the unique FDs for each process name. However, I am no longer sure that its accurately counting what we want, since running it just now with Ray running shows that over 20K open FDs but no error...
An easy workaround here would be to just increase the fd cap to the hard
limit. On my system, ulimit -n -S
(the soft limit) is 1024, but
ulimit -n -H
(the hard limit) is 1048576
. I don’t think that raising
this should have any adverse effects: since we never touch fds directly,
any such bugs would have to be in our dependencies, which are all fairly
widely used.
At a glance, pulling in the rlimit
crate and adding something like
fn increase_fd_limit() -> std::io::Result<()> {
#[cfg(unix)]
{
use rlimit::Resource;
let (old_soft_limit, hard_limit) = Resource::NOFILE.get()?;
Resource::NOFILE.set(hard_limit, hard_limit)?;
debug!("Changed file descriptor limit from {} to {}", old_soft_limit, hard_limit);
}
#[cfg(not(unix))]
{
debug!("Non-Unix; leaving file descriptor limit alone");
}
Ok(())
}
fn try_increase_fd_limit() {
if let Err(e) = increase_fd_limit() {
warn!("Failed to increase file descriptor limit: {}", e);
}
}
to cli.rs
and calling it from main
should do the trick.
I've also run into this problem after an upgrade. I have 10s of thousands of runs however... Attempting to increase the file limit seemed to get some weird behaviour where it stopped being able to be served remotely (but maybe I screwed something else up...) I've gone back to slow loading for now...
I've also encountered the problem and found that raising the "open files" limit by executing e.g.
ulimit -n 50000
solves the problem for me (without requiring superuser permissions).
fyi
ulimit
only works for the current shell session
I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?
2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }
I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...
Compared to <500 in "slow" mode.
In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.
Originally posted by @Raphtor in https://github.com/tensorflow/tensorboard/issues/4784#issuecomment-838599948