nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.87k stars 1.4k forks source link

Support for linux/arm64 (ARMv8, aarch64) #466

Closed vielmetti closed 7 years ago

vielmetti commented 7 years ago

OS/Container environment:

ARMv8 server is a Packet 2A (Cavium ThunderX, 96-core at 2 Ghz)

Steps or code to reproduce the issue:

Expected result:

linux-arm64 supported release

Actual result:

No files found.

As of 2017-04-04, build works fine, tests fail until timeouts are extended, and we've identified a performance issue on ARMv8 Go 1.8 crypto/tls. Further work pending Go performance improvements on ARMv8.

Feature Requests

Use Case:

Two use cases: one for ARMv8 single-board computers (e.g. Raspberry Pi 3, Odroid C2, Pine64); another for ARMv8 in the data center (e.g. Cavium ThunderX).

Proposed Change:

Build and test for arm64, validate that it works, add as supported release.

Who Benefits From The Change(s)?

Users of arm64 (ARMv8) platforms as listed above.

Alternative Approaches

Planning to build from source and see how that goes; I'll use this issue to identify anything that comes up.

vielmetti commented 7 years ago

test fails; looking into this:

root@docker-build-test:~/src/nats-io/gnatsd# go build
root@docker-build-test:~/src/nats-io/gnatsd# go test ./...
?       nats-io/gnatsd  [no test files]
?       nats-io/gnatsd/auth     [no test files]
ok      nats-io/gnatsd/conf     0.018s
ok      nats-io/gnatsd/logger   0.618s
ok      nats-io/gnatsd/server   24.935s
ok      nats-io/gnatsd/server/pse       0.104s
--- FAIL: TestServerRestartReSliceIssue (10.01s)
panic: Unable to start NATS Server in Go Routine [recovered]
        panic: Unable to start NATS Server in Go Routine

goroutine 44 [running]:
panic(0x8154a0, 0x482000b7e0)
        /usr/lib/go-1.6/src/runtime/panic.go:481 +0x384
testing.tRunner.func1(0x4820250870)
        /usr/lib/go-1.6/src/testing/testing.go:467 +0x168
panic(0x8154a0, 0x482000b7e0)
        /usr/lib/go-1.6/src/runtime/panic.go:443 +0x4b4
nats-io/gnatsd/test.RunServerWithAuth(0x482027c3c0, 0x0, 0x0, 0xffff9c66e110)
        /root/src/nats-io/gnatsd/test/test.go:102 +0x180
nats-io/gnatsd/test.RunServerWithConfig(0x9c72f0, 0x14, 0x0, 0x482027c3c0)
        /root/src/nats-io/gnatsd/test/test.go:79 +0x2a4
nats-io/gnatsd/test.runServers(0x4820250870, 0x0, 0x0, 0x0, 0x0)
        /root/src/nats-io/gnatsd/test/cluster_test.go:66 +0x4c
nats-io/gnatsd/test.TestServerRestartReSliceIssue(0x4820250870)
        /root/src/nats-io/gnatsd/test/client_cluster_test.go:17 +0x3c
testing.tRunner(0x4820250870, 0xbff288)
        /usr/lib/go-1.6/src/testing/testing.go:473 +0xbc
created by testing.RunTests
        /usr/lib/go-1.6/src/testing/testing.go:582 +0x65c
FAIL    nats-io/gnatsd/test     11.477s
?       nats-io/gnatsd/util     [no test files]
?       nats-io/gnatsd/vendor/github.com/nats-io/nuid   [no test files]
?       nats-io/gnatsd/vendor/golang.org/x/crypto/bcrypt        [no test files]
?       nats-io/gnatsd/vendor/golang.org/x/crypto/blowfish      [no test files]
?       nats-io/gnatsd/vendor/golang.org/x/sys/windows  [no test files]
?       nats-io/gnatsd/vendor/golang.org/x/sys/windows/registry [no test files]
vielmetti commented 7 years ago

Run from command line works just fine - at least the server comes up.

Is there a particularly good client you'd recommend to exercise the server, @kozlovic ? Happy to bash on it to see if I can trigger whatever this issue is.

kozlovic commented 7 years ago

Just realized that it worked for server package. Could you make sure that there is no gnatsd running in the background and then do this just to check:

go test -race -v -p=1 ./...
kozlovic commented 7 years ago

The -p=1 will ensure that each package is run after the other. I am just wondering if there could be ports conflicts between the tests in different packages. We normally try to use different ports, and it works fine on Travis, but it could be just luck.

vielmetti commented 7 years ago

go test -race is not available in Go 1.6.x on arm64 on Ubuntu.

With -p=1 I get a lot more tests to pass, but a few still fail, all related to TLS:

root@docker-build-test:~# grep FAIL gnats-test.out
--- FAIL: TestTLSConnz (1.12s)
--- FAIL: TestPingSentToTLSConnection (0.71s)
--- FAIL: TestTLSConnection (1.19s)
--- FAIL: TestTLSBadAuthError (1.11s)
FAIL
FAIL    nats-io/gnatsd/test     73.991s
root@docker-build-test:~# go version
go version go1.6.3 linux/arm64
vielmetti commented 7 years ago

Looking in a little more detail, here are all of the error messages:

root@docker-build-test:~# grep "version 4552" gnats-test.out
        monitor_test.go:337: Got an error on Connect with Secure Options: tls: received record with version 4552 when expecting version 303
        test.go:128: Error writing command to conn: tls: received record with version 4552 when expecting version 303
        tls_test.go:44: Got an error on Connect with Secure Options: tls: received record with version 4552 when expecting version 303
        tls_test.go:252: Excpected and auth violation, got tls: received record with version 4552 when expecting version 303
kozlovic commented 7 years ago

You may want to try with a newer version of Go, just to make sure.

kozlovic commented 7 years ago

Oh, that's because the timeouts are too small.

kozlovic commented 7 years ago

Let me see in which place you would have to increase this timeout to make sure that's only that.

kozlovic commented 7 years ago

Two things you could try:

vielmetti commented 7 years ago

Single test still fails:

root@docker-build-test:~/src/github.com/nats-io/gnatsd# go test -v -run=TestTLSConnz ./test
=== RUN   TestTLSConnz
--- FAIL: TestTLSConnz (1.11s)
        monitor_test.go:337: Got an error on Connect with Secure Options: tls: received record with version 4552 when expecting version 303
FAIL
exit status 1
FAIL    github.com/nats-io/gnatsd/test  1.134s

My version of Go is 1.6.3 which is older than the one you recommend; I'll report back separately testing under Go 1.8.

root@docker-build-test:~/src/github.com/nats-io/gnatsd# go version
go version go1.6.3 linux/arm64
kozlovic commented 7 years ago

When you ran the test, have you override the timeouts? For that test specifically, if you do not want to tweak the code, you can modify the config file used in this test:

test/configs/tls.conf

Change both timeout values in this file to 10 instead of 2 and 1.

vielmetti commented 7 years ago

Wtih longer timeouts, the 10 second times patched in above into client.c and server.c, we pass a test:

=== RUN   TestTLSConnz
--- PASS: TestTLSConnz (2.25s)
PASS
ok      github.com/nats-io/gnatsd/test  2.269s

minio has some accelerated crypto routines which should speed up TLS, if that timeout is due to slow performance.

kozlovic commented 7 years ago

Ok, now the problem re-running the whole test suite with the override is that you may then get some test failures because the test expect the timeout to occur say within 2 seconds. But we should be able to figure out if that's the case based on the test name.

vielmetti commented 7 years ago

All the TLS tests now pass, but there's one test that fails:

=== RUN   TestAuthClientNoConnect
--- FAIL: TestAuthClientNoConnect (3.03s)
        test.go:128: Error reading from conn: read tcp 127.0.0.1:43868->127.0.0.1:
10422: i/o timeout

                2 - /root/src/github.com/nats-io/gnatsd/test/auth_test.go:80
                3 - /usr/lib/go-1.6/src/testing/testing.go:473
                4 - /usr/lib/go-1.6/src/runtime/asm_arm64.s:975

The code in auth_test.go:80 reads

        // This is timing dependent..
        time.Sleep(server.AUTH_TIMEOUT)
kozlovic commented 7 years ago

Yes, like I said. So it means that the only failures you got were due to timeout. What surprises me is that you go the failures in the first place. Even with current values (sometimes as low as 0.5 is some config files), it works even when running the suite on Travis, which sometimes is way slower than when we run on our personal laptops. So it is a bit surprising considering the spec of your machine?

vielmetti commented 7 years ago

The timeouts are very surprising given the spec of the machine. I'm going to rebuild with Go 1.8 next, because I know I've seen speed improvements overall with that, and maybe that is enough to help.

With one failed test, I get this as an overall test time:

FAIL    github.com/nats-io/gnatsd/test  80.181s

and it looks like the last log on Travis runs the same tests in

ok      github.com/nats-io/gnatsd/test  68.013s
vielmetti commented 7 years ago

With Go 1.8 it fails a little faster

FAIL    github.com/nats-io/gnatsd/test  78.930s

still failing in

--- FAIL: TestAuthClientNoConnect (3.03s)

I'm sure that's because Go is using software crypto on arm64, rather than the hardware instructions on the chip. The minio code is at https://github.com/minio/sha256-simd which might help.

vielmetti commented 7 years ago

Nope, gnatsd doesn't use the sha256 code, but I was able to benchmark Go's crypto/tls and found it wanting on arm64. The open issue is

https://github.com/golang/go/issues/19840

I'll chase this upstream, for the moment let's mark this issue as "on hold", and I'll work to get a performance improvement.

vielmetti commented 7 years ago

Go 1.9 beta 1 is out and has a binary build for ARM64 (yay).

According to the referenced issue https://github.com/golang/go/issues/19840 the opportunity for this particular performance issue to be resolved in Go for ARM will come in the Go 1.10 timeframe. However there may be other performance improvements in Go 1.9 so that's worth a quick test.

ghost commented 7 years ago

Is it possible for you to summarise the state of the aarch64 server build, we are very interested in using it on our embedded aarch64 platform as a control plane enabler.

vielmetti commented 7 years ago

@salerio - what are the specs for your aarch64 platform? The concern expressed above was that some of the crypto instruction in Go on aarch64 are not hardware accelerated, and that the soft versions of the algorithms have poor performance on one system (Cavium ThunderX).

ghost commented 7 years ago

Its a Xilinx UltraScale+ MPSoC which has 4 x Cortex-A53 CPU complex. Although there are crypto accelerators in the SoC I doubt anyone (any standard software that is) will make sure of them yet as the part is very new.

See https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html

vielmetti commented 6 years ago

@ghost does this MPSoC from Xilinx have an FPGA in it?

vielmetti commented 6 years ago

Go 1.11beta1 is out, I would like to test performance with it.

derekcollison commented 6 years ago

We would be interested in what you find, keep us posted.

vielmetti commented 6 years ago

Thanks @derekcollison I have opened up #695 to address the question of "how do you test performance".

derekcollison commented 6 years ago

Thanks.