github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.44k stars 1.26k forks source link

[bug] panic: runtime error: invalid memory address or nil pointer dereference #400

Open obi11235 opened 7 years ago

obi11235 commented 7 years ago

On a long running alter using the below setup after 20+ hours it get's an error with too many open files and if i try to query gh-ost via it's socket it will crash. Is this a known issue? I'm using the pre compiled version 1.0.36.

./gh-ost --max-load=Threads_running=50,Threads_connected=700 --critical-load=Threads_running=3000,Threads_connected=5000 --chunk-size=1000 --throttle-control-replicas="10.20.2.4,10.30.2.12:3307,10.30.2.13,10.30.2.2,10.30.2.7,127.0.0.1:9901,127.0.0.1:9904,127.0.0.1:9905" --max-lag-millis=20000 --conf=admin.my --host=10.30.2.10 --database="database" --table="guest" --verbose --alter="engine=innodb" --cut-over=default --exact-rowcount --concurrent-rowcount --default-retries=240 --panic-flag-file=/tmp/gh-ost.panic.flag --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag --execute

Copy: 258060000/496528082 52.0%; Applied: 577507; Backlog: 100/100; Time: 21h32m30s(total), 21h32m27s(copy); streamer: mysql-bin.007696:756485258; State: throttled, 127.0.0.1:9905 replica-lag=924.999579s; ETA: 19h54m19s
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
2017-04-10 02:51:50 ERROR accept unix /tmp/gh-ost.guest.sock: accept4: too many open files
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a3f2d]

goroutine 1451637 [running]:
panic(0x6f6ac0, 0xc42000e0f0)
        /usr/local/Cellar/go/1.7.4_2/libexec/src/runtime/panic.go:500 +0x1a1
github.com/github/gh-ost/go/logic.(*Server).handleConnection(0xc42020e740, 0x0, 0x0, 0x0, 0x0)
        /Users/shlomi-noach/dev/go/src/github.com/github/gh-ost/go/logic/server.go:101 +0x7d
created by github.com/github/gh-ost/go/logic.(*Server).Serve.func1
        /Users/shlomi-noach/dev/go/src/github.com/github/gh-ost/go/logic/server.go:81 +0x87
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a3f2d]

goroutine 1451639 [running]:
panic(0x6f6ac0, 0xc42000e0f0)
        /usr/local/Cellar/go/1.7.4_2/libexec/src/runtime/panic.go:500 +0x1a1
github.com/github/gh-ost/go/logic.(*Server).handleConnection(0xc42020e740, 0x0, 0x0, 0x0, 0x0)
        /Users/shlomi-noach/dev/go/src/github.com/github/gh-ost/go/logic/server.go:101 +0x7d
created by github.com/github/gh-ost/go/logic.(*Server).Serve.func1
        /Users/shlomi-noach/dev/go/src/github.com/github/gh-ost/go/logic/server.go:81 +0x87
shlomi-noach commented 7 years ago

This is not a known issue, however:

panic: runtime error: invalid memory address or nil pointer dereference

is likely caused by conn being nil in https://github.com/github/gh-ost/blob/3313cb3933933f4a1c7f20ca62f1084f9f2184ad/go/logic/server.go#L101

I will fix that, and there will be no crash. But that won't solve the fact you have too many open files. I wonder in what manner you query gh-ost via the socket file. From the log see 24 entries for "too many open files" at the same second. Is that the rate at which you query the socket? Perhaps your client does not close the connection?

obi11235 commented 7 years ago

I was querying the socket by hand. This particular alter i performed a different way ultimately but i have more of this size to do in the future. Is there any chance Gh-ost is not closing connections correctly when it get's disconnected from the slaves it is monitoring?

Next time i'm running a big like this on a huge table i'll monitor the open files by gh-ost to see if it yields any useful information.

shlomi-noach commented 7 years ago

Is there any chance Gh-ost is not closing connections correctly when it get's disconnected from the slaves it is monitoring?

Connections to replicas are handled by the go-mysql-driver, underlying the database/sql interface. Connections are actually routinely reused, and are unlikely to timeout, since they really query the replicas frequently.

What you're having is a too many open files on the replica where gh-ost runs.

I'm not sure I understand how you're running these by hand and getting 24 responses within the same second. Are there otherwise too many open files on your system?

obi11235 commented 7 years ago

The system was performing correctly at the time, the only problem was talking to gh-ost. I think the socket just caused the crash, it was having an issue connecting to the master prior to me trying to work with the socket: 2017-04-10 02:51:50 ERROR dial tcp 10.30.2.10:3306: socket: too many open files

The command i issued to the socket was to throttle it since it was having an issue hoping that would help it recover. I wonder if it hit a per process opened file limit.

obi11235 commented 7 years ago

I have started another long alter that i'm monitoring the open files for the gh-ost process, and it looks like every time i get a message about an invalid connection: [MySQL] 2017/04/21 18:24:19 statement.go:27: Invalid Connection

the number of open files spikes for a while, for this operation i'm not monitoring any slaves. I wonder if any of the slaves have an issue that are causing the Invalid Connections if it causes a large spike in the open files for the process as it retries to connect opening more connections