runtime: automatically bump RLIMIT_NOFILE on Unix

bradfitz commented 3 years ago

I just read http://0pointer.net/blog/file-descriptor-limits.html which in a nutshell says:

don't use select
systemd sets the RLIMIT_NOFILE soft limit to 1024 for compatibility reasons, to not break select users
systemd sets the RLIMIT_NOFILE hard limit 512K, for programs that want more (without escalation), but by raising their soft limit past 1024, they're implicitly acknowledging that select won't work.

I realize that since Go doesn't use select, the Go runtime could automatically do this fd soft limit bumping on Linux.

We do have a Select wrapper at https://pkg.go.dev/golang.org/x/sys/unix#Select, though, so perhaps we could do the same thing we did for #42347 in 18510ae88ffcb9c4a914805fde3e613539f9b6dc (https://go-review.googlesource.com/c/go/+/299671) and do the bumping conditionally based on whether the unix.Select func is in the binary. Or cgo too, I suppose.

I suspect many users are unaware of this 512K hard limit that's free to bump up to. I certainly was unaware. (I normally have to go in and manual tweak my systemd limits instead, usually in response to problems once I hit the limit...) I think fixing it automatically would help more users than it'd hurt. (I actually can't think how it'd hurt anybody?)

I don't think we need it as a backpressure mechanism. As the blog post mentions, memory limits are already that mechanism.

/cc @ianlancetaylor @aclements @rsc @randall77

ianlancetaylor commented 3 years ago

The limitation on select is not a kernel limitation. It's a limitation on the implementation of fd_set in glibc. And we've inherited that limitation in x/sys/unix.FdSet, but perhaps we could fix that. If we did, then we could raise the soft limit to the hard limit unconditionally.

I note that on my Debian system the soft and hard limits are both 131072. On my CentOS 6 system the soft limit is 1024 and the hard limit is 4096. On my recent Fedora system the soft limit is 1024 and the hard limit is 524288.

bradfitz commented 3 years ago

Yeah, I saw FdSet was struct { Bits [16]int64 }. Make it opaque with a [16]int64 used by default and a spill-over lazily-allocated bitmap when (*FDSet).Set(fd int) is called with a "big" fd? Doable, but I wonder if it's worth the effort. Does anybody actually use unix.Select?

We'd still need a conditional mechanism regardless for cgo I assume, as we wouldn't know whether C code was using select as easily?

FWIW, on my various Debian (buster) & Ubuntu (focal LTS, hirsute) machines here, I see 1024 & 1048576.

bradfitz commented 3 years ago

Does anybody actually use unix.Select?

GitHub code search says https://github.com/search?l=&p=2&q=unix.Select+language%3AGo&type=Code .... it's mostly wireguard-go's rwcancel package.

(cc @zx2c4 as FYI)

zx2c4 commented 3 years ago

~I'm happy to get rid of that and replace it with poll. (Want to send a patch?)~ Done: https://git.zx2c4.com/wireguard-go/commit/?id=a9b377e9e10eb5194c0bdff32136c11b17253bfd

This proposal sounds like a good idea, with the caveat that we probably shouldn't do it in initialization for -buildmode=shared.

rsc commented 2 years ago

What happens today, even in programs that do nothing but file I/O (no select etc), is that if you open too many files you get errors. Auto-bumping would let those programs run longer.

If Go did it at startup, it would be inherited by non-Go programs that we fork+exec. That is a potential incompatibility, but probably not a large one. Technically, I suppose we could undo it in the subprocess between fork and exec.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

nightlyone commented 2 years ago

To summarize the limitating use cases where we should not be raising the soft limit.

select implementation from glibc (or other libc like musl too?) is used
cgo is used because dynamic linking can load anything with dlopen and also can cause exec calls in places we don't know.
select implementation from our own syscalls or unix package is used, unless that one is changed as suggested above.
NSS (user/group lookup and DNS lookup) is used from glibc.
and we need to reset, if we call the exec family of syscalls

rsc commented 2 years ago

One problem with restoring the limit in exec is we won't know if the limit was intentionally changed by the program in the interim. What about programs that explicitly raise the limit and then exec today? We would be dropping it back down.

It seems like if we are going to raise the limit, we should just do that, not try to put it back.

I just ran into this problem with gofmt on my Mac, where the limit defaults to 256 (and gofmt was editing many files in parallel). I'd love for Go to raise the limit there too.

How much does it really matter if we raise the limit for a subprocess?

People can always set the hard limit if they want Go not to try to bump the soft limit up.

rsc commented 2 years ago

It's pretty awful that the limit is breaking completely reasonable Go programs like gofmt -w. It seems very wrong for gofmt to have to put a bump in. It seems like we should bump the limit at startup - Go doesn't use select.

It's very hard to see any programs benefiting from this limit in practice anymore. I understand that systemd can't make such a global decision, but I think Go can.

zx2c4 commented 2 years ago

I think that seems quite reasonable.

We can even document this in unix.Select/syscall.Select and mark them as deprecated so that editors bring attention to them and maybe add something to go vet too. It seems always possible to move to poll or similar.

rsc commented 2 years ago

Not sure anyone is using syscall.Select for fd's anyway. Every time I've used it in the past decade it has been to get a sub-second-resolution sleeping API (selecting on no fds).

rsc commented 2 years ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

AlekSi commented 2 years ago

Should the title be updated to mention Unix or something instead of Linux? Personally, I constantly run into that limitation on macOS; would like to see that resolved.

ianlancetaylor commented 2 years ago

The considerations may be different on different Unix systems. On Linux the details are somewhat specific to systemd.

It may well be appropriate to do this on macOS also, but I don't know what the tradeoffs are there. Why does macOS have a default low limit?

AlekSi commented 2 years ago

From what I was able to find, that default goes back to the very first OS X release and probably even back to BSD. The constant is there.

Of course, not doing that on macOS is not a deal-breaker but an annoyance.

kolyshkin commented 2 years ago

The only issue I am aware of that can arise if RLIMIT_NOFILE is set to a very high value is, some binaries (that may be executed from a Go program and thus inherit the limit) want to do something like this (pseudocode):

for fd := 3; fd < getrlimit(RLIMIT_NOFILE); fd++ {
      close(fd) // or set CLOEXEC flag
}

For a specific example, rpm package manager used to do that (fixed by https://github.com/rpm-software-management/rpm/commit/5e6f05cd8dad6c1ee6bd1e6e43f176976c9c3416), and also some older version of Python (but I'm not sure).

Most probably this should not be an issue, since Docker also does a similar thing (https://github.com/moby/moby/issues/38814) and since everyone seems to be using containers now, let's hope that issues like this are fixed (yet better, maybe some programs have even started using close_range()).

Also, this is surely not a showstopper to accept the proposal -- just something to keep in mind.

rsc commented 2 years ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

gopherbot commented 2 years ago

Change https://go.dev/cl/392415 mentions this issue: os: raise open file rlimit at startup

gopherbot commented 2 years ago

Change https://go.dev/cl/393016 mentions this issue: Revert "os: raise open file rlimit at startup"

bcmills commented 2 years ago

The test for this change is failing on at least three builders — looks like we may need to plumb in OPEN_MAX for the case where getrlimit reports RLIM_INFINITY instead of the true max.

rsc commented 2 years ago

I'm confused about needing OPEN_MAX:

On my Mac with macOS 12.2:

% ulimit -n
ulimit -n 256
% ulimit -nH
ulimit -n unlimited
% ulimit -n unlimited
% ulimit -n
ulimit -n unlimited
%

I've always used 'ulimit -n unlimited' without trouble on Macs. I wonder if the struct definitions are wrong.

bcmills commented 2 years ago

It looks like the macOS setrlimit behavior may have changed as of 11.0. (The 10.14 and 10.15 builders broke, but the 11.0 and 12.0 builders did not.)

bcmills commented 2 years ago

So I suppose one option might be to try the setrlimit call with the limit as given by getrlimit, and if that fails fall back to increasing to some reasonable-but-arbitrary constant (ideally equal to OPEN_MAX)..?

ianlancetaylor commented 2 years ago

Some more info at #40564.

rsc commented 2 years ago

I put in a call to sysctl kern.maxfilesperproc. Hopefully that exists on the older macOS. And I skipped the OpenBSD failure entirely. (It is not a first-class port.)

gopherbot commented 2 years ago

Change https://go.dev/cl/393354 mentions this issue: os: raise open file rlimit at startup

gopherbot commented 2 years ago

Change https://go.dev/cl/394094 mentions this issue: os: skip TestOpenFileLimit on openbsd/mips64

polarathene commented 1 year ago

How much does it really matter if we raise the limit for a subprocess? People can always set the hard limit if they want Go not to try to bump the soft limit up.

I understand that systemd can't make such a global decision, but I think Go can.

FWIW, containerd previously could use systemd to configure limits with LimitNOFILE=1024:524288 in a service file. All containers and the processes they run would inherit those limits, and could optionally raise their own soft limits.

Since Go 1.19, providing an explicit soft limit this way does not work anymore. It'll get raised implicitly to the hard limit assigned, I'm not sure if containerd was aware of this as their default config shipped is LimitNOFILE=infinity.

High RLIMIT_NOFILE (especially the soft limit) has been a source of confusion and difficult to troubleshoot problems from users when software (particularly daemons) was not designed to operate on a range spanning over a billion file descriptors (each with a close() syscall and the majority not open) stalling with heavy CPU load, or allocating ridiculous amounts of memory (thousands more).

Presently, if one relies on the default limits systemd assigns (1024:524288), even on Debian based systems where the system-wide hard limit per-process (fs.nr_open) is 1048576 (2^20) instead of 2^30, the implicit behaviour could not be opt-out by the user, ~~unless they raise the hard limit to infinity.~~ (EDIT: Mishap on my end, not actually supported)

~~Was it intentional to require assigning a hard limit as high as 2^30 to a process, just so a lower soft limit could be applied?~~

I understand this is technically niche (beyond the popularity of the software itself), but due to each process belonging to a container, and each container belonging to the parent containerd, the hard limit for containerd needs to be sufficiently high, but a soft limit at that same limit is worse to troubleshoot abnormal behaviour vs an error message of too many files being open.

I've reported the regression to containerd, and assume it's something they should attempt to resolve on their end? (assuming they can opt-out internally, or know the original soft-limit)

ianlancetaylor commented 1 year ago

@polarathene I don't see the connection between what you describe and the Go runtime. Is containerd a Go program? If so, I agree that they should be aware of this change and should configure their limits accordingly.

polarathene commented 1 year ago

I don't see the connection between what you describe and the Go runtime.

containerd creates containers, which run various processes, they all are run from the same containerd process parent, thus inherit the limit.

Setting a soft limit on the OS for containerd to respect worked well until Go 1.19 released (and a new release of containerd arrived). Now the intended soft limit is ignored.

As a consumer of a Go program - One must now choose a practical hard limit without control of the soft limit.

This Go 1.19 change is the source of that regression slipping through (I don't think many Go programs had tests in place to ensure limits set externally were respected, since that was reliable / consistent when set explicitly via ulimit or similar). This was not something containerd had to decide / manage internally.

Is containerd a Go program?

Yes, it is heavily used for for containers with Docker and Kubernetes:

If so, I agree that they should be aware of this change and should configure their limits accordingly.

I am not a Go developer, and it's not clear to me if the "feature" supports opt-out?

I have raised a bug report to containerd as mentioned earlier, but I'm not sure if the changes introduced here for Go 1.19 still allow for running with the original inherited limits like before Go 1.19?

Can containerd still know the original soft limit and restore that value? Or is it lost by the time the limit is raised implicitly?

containerd relied on external limits being set for this (while shipping a default in the past of 1048576, presently infinity) as the relevant limits depend on environment (eg: local dev vs production at scale).

The limits can make some builds in containers take 10 hours instead of 15 minutes.
In my own experience a popular Python project that would start in 1 second would instead take 70 minutes. These can generally be avoided by a soft limit, and this will still work with Go 1.19 if using infinity as the hard limit.
Some software has been known to behave similar to Go 1.19 and implicitly raise soft to high limit, but also not intended to run with a hard limit as large as a billion, which at that size allocates excessive amounts of memory causing failure from OOM_).

There are workarounds available at the container level, but these were not needed previously (_configuring RLIMIT_NOFILE in a systemd unit that is applied before running containerd_).

I just wanted to chime in and raise awareness of a problem case from this change. As a user, it was a surprise that my explicit soft limit was not respected (because I did not want to have a hard limit of infinity).

ianlancetaylor commented 1 year ago

Thanks. This sounds to me like something to handle in containerd configuration.

polarathene commented 1 year ago

How should a Go program opt-out of this behaviour (if it's possible), so that it acts like releases prior to 1.19? (inheriting and using the soft limit assigned outside of the binary).

This sounds to me like something to handle in containerd configuration.

I understand that.

Could you please answer the question I raised in each prior comment?:

Can containerd still know the original soft limit and restore that value? Or is it lost by the time the limit is raised implicitly?

Are you suggesting that instead of the system RLIMIT_NOFILE soft limit being usable, now containerd (and other Go software) can only support lowering the soft limit internally through some configuration? (file / CLI arg / ENV var)

I'll relay that information to the containerd bug report, and for readers landing here it would better document to Go devs how to handle that too.

The Go 1.19 release notes were not very clear on this beyond:

One impact of this change is that Go programs that in turn execute very old C programs in child processes may run those programs with too high a limit. This can be corrected by setting the hard limit before invoking the Go program.

The hard limit is set, and this advice is regarding select() by lowering the hard limit to the desired soft limit... which prevents such a child process from creating any children that would want to raise a soft limit to a higher hard limit.

I had to look at the source os/rlimit.go to discover that ~~soft limit would be respected if the hard limit was raised to infinity / sysctl fs.nr_open~~ (EDIT: not actually supported - nor would it be a proper solution):

// Some systems set an artificially low soft limit on open file count, for compatibility
// with code that uses select and its hard-coded maximum file descriptor
// (limited by the size of fd_set).
//
// Go does not use select, so it should not be subject to these limits.
// On some systems the limit is 256, which is very easy to run into,
// even in simple programs like gofmt when they parallelize walking
// a file tree.
//
// After a long discussion on go.dev/issue/46279, we decided the
// best approach was for Go to raise the limit unconditionally for itself,
// and then leave old software to set the limit back as needed.
// Code that really wants Go to leave the limit alone can set the hard limit,
// which Go of course has no choice but to respect.
func init() {
    var lim syscall.Rlimit
    if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &lim); err == nil && lim.Cur != lim.Max {
        lim.Cur = lim.Max
        adjustFileLimit(&lim)
        syscall.Setrlimit(syscall.RLIMIT_NOFILE, &lim)
    }
}

So what is software like containerd to do? (wanting a high hard limit, but keeping a lower soft limit - appropriate values depend on environment)

ianlancetaylor commented 1 year ago

Can containerd still know the original soft limit and restore that value? Or is it lost by the time the limit is raised implicitly?

I don't know of a way to do that, no. A program can lower the soft limit, of course, but it doesn't have currently have a way to know what the soft limit was originally. We could add something but it's not yet clear to me that that is necessary. Right now you are conveying information to container by setting the soft limit. But there must be other ways to convey information to containerd.

polarathene commented 1 year ago

TL;DR: Without a way to restore the behaviour pre-1.19, instead mandating config to workaround it... should I raise a bug report in this repo?

Or all downstream projects this affects must be aware of it and accommodate support via other config means, whilst including disclaimers to sysadmins that soft limits are no longer supported?

We could add something but it's not yet clear to me that that is necessary.

Right...so this is a regression introduced by Go 1.19?

Software built with Go that does run child processes are unable to support a well known and used standard of configuring appropriate soft + hard limits for workloads via RLIMIT_NOFILE (software distributed like containerd has supplied init configs, like for systemd containerd.service that support this via LimitNOFILE).
Go has decided to completely break that, requiring any software that needs a lower soft limit to either:
- Lower the hard limit instead, preventing child processes from being able to use anything higher if they need to, while ensuring that any need for lower soft limit of 1024 is applied on the relevant child processes.
- If needing distinct soft vs hard limits, the Go program developer must now implement support internally, likely requiring some form of configuration (file, CLI arg, ENV var, etc). Something that was already standard and fairly reliable external of the software being run.

I understand this change with Go 1.19 was for those not informed about being able to raise the soft limit to the hard limit (which I agree is a bad UX problem to run into on both fronts):

Users of software built with Go that shouldn't need to worry or know about these limits, and if software breaks, lower the hard limit instead.
Go devs - Intended to make their software better by avoiding a common EMFILE issue, especially on macOS, based on guarantees that their program likely didn't need the default 256 / 1024 soft limits, so should avoid the implicit penalty.

Now that implicit penalty is delegated to larger more complex software built with Go with no opt-out option available. These projects tend to be slower moving due to release cycles and broad platform support / needs. In particular containerd is a dependency to multiple other projects, also affected in the same way, and this is at a lower level of the stack for devs that it affects on their local systems, not only production.

The users affected, must now:

Try to figure out where the problem is coming from.
Trace it down to the limits (as with the soft limit being too high, you don't get an appropriate error, but instead abnormalities often from third-party software putting heavy pressure on CPU / memory)
Discover where this config issue has happened (this has been an issue prior to Go 1.19 due to default LimitNOFILE=infinity supplied by containerd)
Add a lower soft limit, and observe it has no effect. Learn about various other sources that could impact that, or distro specific changes, etc. Possibly at this point, they'll discover it's a problem with Go 1.19, but mind you this is just one change over the years that complicates what is going wrong (so on some systems it is not easily reproducible).
Or they don't get that far and manage to use a lower hard limit instead without any issues of a soft limit being higher than 1024. Not everyone can do this, the high limit needs to be sufficient to support all child containers / processes needs as well.
While this issue persists, they must find alternative workarounds (you can provide explicit configuration/args to run your containers with lower limits, or do so within the container to run processes with lower limits). It's just a bit absurd to enforce that when needed because something that was functional and agnostic to the executable no longer works like it should (admins/devops must also be aware of this now).

Right now you are conveying information to container by setting the soft limit. But there must be other ways to convey information to containerd.

This was a standard way to convey that information that Go 1.19 chose to break.

Before Go 1.19, you could convey this same information to your Go program the same way. You'll find other software that additionally adds config to support conveying that information a different way, but this was not desirable anymore for Go 1.19 which chose to inverse the situation.

I use Docker personally, but a quick glance shows containerd as providing an API for projects like Docker to communicate over. I am aware of Docker having some daemon config related to these limits and will investigate (UPDATE: --default-ulimit or in /etc/docker/daemon.json via default-ulimits key), but this would likely mean any other client of containerd would need to add support for the same.

This isn't an issue that applies to just containerd either. It seems a bizarre stance to insist it's ok to break this support without providing opt-out. I understand why it was decided to raise the soft limit implicitly, and I fully support that, but that should not break software that properly supported / used RLIMIT_NOFILE, requiring each to workaround it without a way to know what the soft limit was intended to be, all consumers of that software likewise must adjust configuration accordingly.

ianlancetaylor commented 1 year ago

Does this affect anything other than containerd?

polarathene commented 1 year ago

Does this affect anything other than containerd?

I imagine it applies to any project doing similar process management? How many of the projects exist and if they have alternative config to support lowering the soft limit for the processes they run as children, I don't know :man_shrugging:

Some popular projects that may be affected:

They have much smaller communities in comparison, so likelihood of bug reports about it would be lower.

Additionally:

In most environments the issue should be less severe as the default hard limit since 2019 with systemd is 2^19 (524288).
The reason Docker and containerd are more prone is because they supply .service files for systemd that set LimitNOFILE=infinity, which raises that hard limit to 2^30.
Debian based distros opt-out (via build flag) of systemd raising the fs.nr_open limit being raised from default 2^20 to 2^30. They do this because of a bug with PAM that would without explicit config for a limit on login sessions, raise it to infinity, which for whatever reasons has not been addressed AFAIK.

I'm not saying this kind of problem is entirely the fault of Go. Other software that raises the hard limit to infinity really shouldn't be doing that (it was more relevant prior to systemd v240, but lingers). It's less of a problem when the soft limit isn't also infinity, but that's now irrelevant when Go 1.19 is involved.

While a saner hard limit like 2^19 or 2^20 should minimize the issue, some software will still allocate more memory than it would be expected to (hundreds of MB) or on lower end hardware still stall if performing syscalls multiplied by that limit (like a common daemon pattern during init).
In some situations if this affects a CLI tool or something within the shell environment startup, it may also be poor UX if it incurs several seconds of delay each time. How likely that is, I'm not sure, but it's likely difficult to troubleshoot to this as the cause and track it down to something Go did.
Losing the ability to control the soft limit via the OS support for doing so isn't great if one must now be mindful of this gotcha, and each time investigate and possibly lookup how each affected project supports configuring this in it's own special way.

I am just a user and probably exaggerating the problem. If more users chime in affected by other projects, or the devs of those projects express concern, perhaps only then should any action be considered.

cpuguy83 commented 1 year ago

Does this affect anything other than containerd?

This effects everything that calls exec.Command

--- EDIT ---

It seems like the value should be set back to the original after fork. It is a bit of extra overhead (especially if you are execing another go program), however it seems like the most correct thing to do here given go programs will often want to exec other programs where this will cause issues.

corhere commented 1 year ago

Quoting the blog post which prompted the proposal in the first place:

If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

(emphasis mine.)

gopherbot commented 1 year ago

Change https://go.dev/cl/476097 mentions this issue: syscall: restore original NOFILE rlimit in child process

gopherbot commented 1 year ago

Change https://go.dev/cl/476096 mentions this issue: os, syscall: move rlimit code to syscall

ianlancetaylor commented 1 year ago

@corhere Thanks for the pointer. That makes sense to me.

@cpuguy83 Do the changes in https://go.dev/cl/476096 and https://go.dev/cl/476097 fix the problem? Thanks.

cpuguy83 commented 1 year ago

@ianlancetaylor I'm using https://github.com/cpuguy83/test-go-rlimits to test this and not seeing the value reset.

ianlancetaylor commented 1 year ago

@cpuguy83 Thanks. Your program uses syscall.Exec, which is uncommon compared to syscall.StartProcess which is used by the os and os/exec packages. I didn't consider syscall.Exec, so thanks for trying it. I just uploaded a new version of https://go.dev/cl/476097 to fix it.

cpuguy83 commented 1 year ago

@ianlancetaylor Thanks! This change works well. Tried with both syscall.Exec as well as exec.Command().Run()

gopherbot commented 1 year ago

Change https://go.dev/cl/476695 mentions this issue: unix: change Setrlimit/Prlimit to always call syscall functions

ianlancetaylor commented 1 year ago

@gopherbot Please open backport issues.

CL 393354, which first appeared in Go one point nineteen, causes Go programs that start other non-Go programs to lose the soft NOFILES rlimit. This breaks programs that care about that. CLs 476096 and 476097 fix the problem by changing the syscall fork and exec functions to restore the original soft rlimit if appropriate. This is a request to backport those changes to the currently supported releases.

gopherbot commented 1 year ago

Backport issue(s) opened: #59063 (for 1.19), #59064 (for 1.20).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot commented 1 year ago

Change https://go.dev/cl/478659 mentions this issue: [release-branch.go1.20] syscall: restore original NOFILE rlimit in child process

gopherbot commented 1 year ago

Change https://go.dev/cl/478660 mentions this issue: [release-branch.go1.19] syscall: restore original NOFILE rlimit in child process

gopherbot commented 1 year ago

Change https://go.dev/cl/486576 mentions this issue: syscall: add prlimit

golang / go

runtime: automatically bump RLIMIT_NOFILE on Unix #46279