golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.16k stars 17.69k forks source link

runtime: high startup address space usage (RLIMIT_AS) on Linux AMD64 #38010

Open pkramme opened 4 years ago

pkramme commented 4 years ago

What version of Go are you using (go version)?

$ go version
go version go1.14 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/vorvvbgc/.cache/go-build"
GOENV="/home/vorvvbgc/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/vorvvbgc/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/vorvvbgc/go1.14/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/vorvvbgc/go1.14/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build969828813=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I am trying to get an FastCGI server running behind an Apache2 webserver on a shared hosting system using the net/http/fcgi library. The webserver is limiting my software to 512MB memory.

This is the code: https://play.golang.org/p/Z-Gc6icOpw5

What did you expect to see?

I expect to see "This was generated by Go running as a FastCGI app" on the website generated by the FastCGI server.

What did you see instead?

I have modified the sysReserve() function in the runtime to include println() to print out the error code from mmap() and the requested memory size. This is a diff of src/runtime/mem_linux.go and my version:

157a158,159
>       println(err)
>       println(n)

I kept the output in the following output in the hopes that it might be useful.

The application crashes with this trace:

0
131072
0
1048576
0
8388608
0
67108864
12
536870912
fatal error: failed to reserve page summary memory

runtime stack:
runtime.throw(0x6f3456, 0x25)
        /home/vorvvbgc/go1.14/go/src/runtime/panic.go:1112 +0x72 fp=0x7ffc17e5b170 sp=0x7ffc17e5b140 pc=0x433a12
runtime.(*pageAlloc).sysInit(0x939428)
        /home/vorvvbgc/go1.14/go/src/runtime/mpagealloc_64bit.go:80 +0x13f fp=0x7ffc17e5b1e8 sp=0x7ffc17e5b170 pc=0x42ac1f
runtime.(*pageAlloc).init(0x939428, 0x939420, 0x94db38)
        /home/vorvvbgc/go1.14/go/src/runtime/mpagealloc.go:297 +0x75 fp=0x7ffc17e5b210 sp=0x7ffc17e5b1e8 pc=0x4288b5
runtime.(*mheap).init(0x939420)
        /home/vorvvbgc/go1.14/go/src/runtime/mheap.go:694 +0x274 fp=0x7ffc17e5b238 sp=0x7ffc17e5b210 pc=0x425ad4
runtime.mallocinit()
        /home/vorvvbgc/go1.14/go/src/runtime/malloc.go:470 +0xff fp=0x7ffc17e5b268 sp=0x7ffc17e5b238 pc=0x40c41f
runtime.schedinit()
        /home/vorvvbgc/go1.14/go/src/runtime/proc.go:545 +0x60 fp=0x7ffc17e5b2c0 sp=0x7ffc17e5b268 pc=0x437100
runtime.rt0_go(0x7ffc17e5b2f8, 0x1, 0x7ffc17e5b2f8, 0x0, 0x7fd2029790ca, 0x1, 0x7ffc17e5cbb6, 0x0, 0x7ffc17e5cbc4, 0x7ffc17e5cbe6, ...)
        /home/vorvvbgc/go1.14/go/src/runtime/asm_amd64.s:214 +0x125 fp=0x7ffc17e5b2c8 sp=0x7ffc17e5b2c0 pc=0x460655

The application works fine with golang 1.13.9.

I have no idea how to debug this further.

andybons commented 4 years ago

@aclements

ivzhh commented 4 years ago

@pkramme Hi, would you mind to post your Apache setup for this too? I did not reproduce this on a fresh Apache. Maybe it is due to my configuration.

alexzorin commented 4 years ago

@pkramme is this shared hosting environment cPanel by any chance?

We also started getting reports of this same panic with our Go application, which exposes itself as a .live.cgi FastCGI net/http/cgi server integrating with cPanel's LiveAPI, as soon as we upgraded to 1.14.

Going to downgrade to 1.13.9 for now.

pkramme commented 4 years ago

@aleksator No, it is not, it is a custom build setup. @ivzhh I'm not able to share the config, as it is proprietary.

Theoretically, if we execute any code with 512MB memory limitation, the problem should become visible. I will try to produce something not based on fcgi as a reproducer, so that no apache2 setup is necessary.

aleksator commented 4 years ago

Tagging a proper person here: @alexzorin

alexzorin commented 4 years ago

I think what @pkramme suggested about the 512MB memory limit is correct - specifically RLIMIT_AS.

"Back in the day" (EL5-ish era), shared web hosting admins did not have access to the RSS cgroups controller (because of EL5's ancient kernel), and so controlling VSZ limits was the only choice available to them. In the long term, this has resulted in a lot of misguided admins keeping these VSZ limits around for no good reason.

Anyway, the Apache-based reproducer is straightforward. (For some reason, a simple Go hello world wrapped in a bash ulimit -v didn't repro for me, not sure why).

  1. Compile a very simple net/http/cgi binary using Go 1.14.1 and stick it in Apache httpd 2.4's cgi-bin/:
package main

import (
    "net/http/cgi"
)

func main() {
    if err := cgi.Serve(nil); err != nil {
        panic(err)
    }
}
go build -o /var/www/html/cgi-bin/reproducer.cgi reproducer.go
  1. Configure Apache with a 512MB RLimitMEM and restart Apache (note, don't try this in Docker or LXC-like environments, setrlimit will just fail and the repro won't work):

    RLimitMEM 536870912

    apachectl -k restart

  2. Access http://localhost/cgi-bin/reproducer.cgi. It will produce an HTTP 500, and in Apache's error_log, you will see the panic stack from the original report.

I would prefer not to ask our customers to remove the rlimit (or else we'll be stuck shipping with Go 1.13 for all eternity).

Would it be practical for the Go runtime to try work within whatever it sees by getrlimit?

mashedkeyboard commented 4 years ago

Adding that I'm also seeing this issue in a different memory-limited environment with a 512mb limit (a Grid Engine setup). Raising the memory limit to 950mb fixes the issue, but it's unclear to me why it should ever be an issue anyway - the program does not use that much memory during running.

sbinet commented 4 years ago

apologies for the "me too" post, but this has also prevented to migrate a little Go-based "script" of one of my colleagues at CERN from Go-1.13.x to the latest Go-1.14.x.

pkramme commented 4 years ago

Well, after reinvestigating this issue I stumbled over the proposal for the new page allocator which was introduced in go1.14: https://github.com/golang/proposal/blob/master/design/35112-scaling-the-page-allocator.md

There are only two known adverse effects of this large mapping on Linux:

  1. ulimit -v, which restricts even PROT_NONE mappings.
  2. Programs like top, when they report virtual memory footprint, include PROT_NONE mappings.

In the grand scheme of things, these are relatively minor consequences. The former is not used often, and in cases where it is, it's used as an inaccurate proxy for limiting a process's physical memory use. The latter is mostly cosmetic, though perhaps some monitoring system uses it as a proxy for memory use, and will likely result in some harmless questions.

So, this explains it. @aclements Is there a workaround for cases like this?

networkimprov commented 4 years ago

cc @mknyszek

mknyszek commented 4 years ago

As @pkramme points out, we were aware of this issue when the changes to the page allocator were proposed. As @alexzorin points out, ulimit -v is an out-dated mechanism for limiting memory use.

I would prefer not to ask our customers to remove the rlimit (or else we'll be stuck shipping with Go 1.13 for all eternity).

Would it be practical for the Go runtime to try work within whatever it sees by getrlimit?

The short answer is no. The virtual memory mappings made to support structures in the page allocator significantly simplified the improvements made in the 1.14 release. Earlier on in the release cycle the amount of memory mapped was much larger which caused problems on certain platforms where the default ulimit -v value for default users was fairly low, so out-of-the-box Go programs would not work on an out-of-the-box system without having additional privileges (see #35568). This is generally not true on Linux where ulimit -v is unlimited by default (at least for the versions I'm aware of). We took steps to reduce the size of these mappings at the cost of additional complexity and a small performance regression. We experimented a little with additional mitigations but concluded they weren't practical.

@sbinet @mashedkeyboard @alexzorin @pkramme:

In order to understand your situations better, could you elaborate on the reasons why your and/or your customers cannot set RLIMIT_AS/ulimit -v to unlimited, or an otherwise sufficiently high number for your Go programs?

As a side note, (and to be totally clear, I'm not recommending this as an official workaround) compiling your code with GOARCH=386 should allow your code to run on amd64 platforms with a low ulimit -v, since the memory mapping we make is proportional to the size of the address space and the address space is much smaller on 386. I recognize that this has its issues, and is not generally a feasible alternative. The most notable issues that come to mind are that your code might run slower (due to 32-bit registers and a lack of certain intrinsics) or some libraries you code depends on might not support 32-bit platforms (I'm not sure how common it is for libraries to support amd64 but not 386, but it is possible).

alexzorin commented 4 years ago

could you elaborate on the reasons why your and/or your customers cannot set RLIMIT_AS/ulimit -v to unlimited, or an otherwise sufficiently high number for your Go programs?

This is our plan. It's going to be a challenge for XX,000 hosts between X,000 customers, so we are first planning to add telemetry to our 1.13 builds to see how many systems run the CGI under restricted virtual memory.

pkramme commented 4 years ago

Sure!

The Go software I am writing is supposed to run on a shared hosting server. The technical foundation is a LAMP (linux apache2 mysql php/python/...) stack. Inside a shared hosting LAMP stack, the Apache2 webserver is spawning CGI/FastCGI software inside a restricted environment, which is heavily controlled in access and resources by the provider, in order to prevent one user taking up all the resources. On my shared hosting account one FastCGI process is limited to 512MB "memory".

The important part is that in managed hosting, you simply cannot make that change, because you do not control the environment. The only possibility for me is to upgrade to another, more expensive hosting plan, so that I can use 1GB memory or more so that this allocation works.

sbinet commented 4 years ago

We don't have much lever on how to configure the CGI environment. and CERN-IT is a bit conservative w/ changing the configuration of services they provide for their physicists (who are sometimes a bit "cavalier" with how they setup their things.)

nonetheless, I've sent a ticket on raising the RLIMIT_AS. I've also passed on to my colleague the 32b workaround.

we'll see.

(anyways, it's not a high profile CGI service, we won't miss supersymetry nor mini-blackholes, or loose the beam if we're stuck w/ Go-1.13.x b/c of that. :P)

pkramme commented 4 years ago

@mknyszek Is there any progress on this on your end?

alexzorin commented 4 years ago

we are first planning to add telemetry to our 1.13 builds to see how many systems run the CGI under restricted virtual memory.

To put a conclusion on this from my end, we gathered some RLIMIT_AS stats and the number of affected users is around 0.5%. The majority have a limit of 4096MB set on Apache, which is the vendor default on this platform.

As long as Go continues to work within that limit, we're happy to live with it and will ask those other users to adapt. Thanks.

theromis commented 2 years ago

any progress on it?

CaledoniaProject commented 1 year ago

Can't believe this remains unresolved. Are we supposed to downgrade golang to 1.13?

f1-outsourcing commented 1 month ago

any progress on it?

ianlancetaylor commented 1 month ago

I don't know that anybody knows of a feasible way to fix this. The Go memory system expects that address is space is available. Address space costs nothing. It makes sense to constrain program's use of actual memory. It does not make sense to constrain program's use of address space.

f1-outsourcing commented 1 month ago

I don't know that anybody knows of a feasible way to fix this. The Go memory system expects that address is space is available. Address space costs nothing. It makes sense to constrain program's use of actual memory. It does not make sense to constrain program's use of address space.

I am not really an expert in memory use / allocation. I am not even sure what you mean by address space. If it is only testing and not using. ;) The fact is I wasted quite a bit of time before noticing it was related to this imposed memory limit somewhere of the parent application This makes me also wonder about container environments applying resource limits, are these affected as well by this?

mknyszek commented 1 month ago

I am not really an expert in memory use / allocation. I am not even sure what you mean by address space. If it is only testing and not using. ;)

See https://go.dev/doc/gc-guide#A_note_about_virtual_memory.

This makes me also wonder about container environments applying resource limits, are these affected as well by this?

They are not. Container limits pertain to actual physical memory usage (RSS in top) not virtual memory footprint (VSS in top).