sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.91k stars 738 forks source link

Beacon node exit with code 132 in docker #1395

Closed bjaq closed 4 years ago

bjaq commented 4 years ago

Description

When running lighthouse with docker-compose in Ubuntu, my beacon node container exit quickly after starting, with an exit code 132. I see no errors in the logs apart from the docker exit code (see below). Geth and the validator client are running fine. I'm a bit lost.

Capture

beacon_node_1       | Jul 25 19:49:19.292 DEBG Completed batch received                awaiting_batches: 4, blocks: 48, id: 1, service: sync
beacon_node_1       | Jul 25 19:49:19.294 DEBG HTTP API request successful             duration_ms: 0, path: /spec/eth2_config, service: http
beacon_node_1       | Jul 25 19:49:19.300 DEBG Completed batch received                awaiting_batches: 4, blocks: 58, id: 4, service: sync
beacon_node_1       | Jul 25 19:49:19.303 DEBG Requesting batch                        peer: 16Uiu2HAm9phNDb4SXXTDoMJDS2mLCGyf5LLTh6ZyxF8LdiGMTDNf, id: 7, end_slot: 448, start_slot: 384, chain_id: 8780547912560591702, service: sync
validator_client_1  | Jul 25 19:49:19.304 INFO Genesis has already occurred            seconds_ago: 2272754
beacon_node_1       | Jul 25 19:49:19.304 DEBG HTTP API request successful             duration_ms: 0, path: /beacon/genesis_time, service: http
beacon_node_1       | Jul 25 19:49:19.305 DEBG HTTP API request successful             duration_ms: 0, path: /beacon/genesis_validators_root, service: http
beacon_node_1       | Jul 25 19:49:19.305 DEBG Processing batch                        end_slot: 63, start_slot: 0, blocks: 48, id: 1, service: sync
geth_1              | INFO [07-25|19:49:19.308] Upgrading chain index                    type=bloombits percentage=18
geth_1              | INFO [07-25|19:49:19.307] Loaded local transaction journal         transactions=0 dropped=0
geth_1              | INFO [07-25|19:49:19.314] Regenerated local transaction journal    transactions=0 accounts=0
validator_client_1  | Jul 25 19:49:19.322 INFO Loaded validator keypair store          voting_validators: 0
validator_client_1  | Jul 25 19:49:19.325 INFO Block production service started        service: block
validator_client_1  | Jul 25 19:49:19.335 INFO Attestation production service started  next_update_millis: 9664, service: attestation
geth_1              | INFO [07-25|19:49:19.360] Allocated fast sync bloom                size=512.00MiB
geth_1              | ERROR[07-25|19:49:19.523] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
geth_1              | ERROR[07-25|19:49:19.523] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
geth_1              | ERROR[07-25|19:49:19.524] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
geth_1              | INFO [07-25|19:49:19.590] New local node record                    seq=12 id=821ecfdc008da69a ip=127.0.0.1 udp=30303 tcp=30303
geth_1              | INFO [07-25|19:49:19.617] Started P2P networking                   self=enode://f9b2771aeab9bc0dfc773916a1d7084888e9bb94e0f106ec685fd869672fb5d092fb224bd4223a8089c1cf9c5407c68a932a3fa6675234af8ea6cb63e7f4d310@127.0.0.1:30303
geth_1              | INFO [07-25|19:49:19.649] IPC endpoint opened                      url=/root/.ethereum/goerli/geth.ipc
geth_1              | INFO [07-25|19:49:19.679] HTTP endpoint opened                     url=http://[::]:8545/               cors= vhosts=*
lighthouse-docker_beacon_node_1 exited with code 132
geth_1              | INFO [07-25|19:49:27.397] Initializing fast sync bloom             items=1763753 errorrate=0.000 elapsed=8.014s
geth_1              | INFO [07-25|19:49:27.450] Upgrading chain index                    type=bloombits percentage=21
geth_1              | INFO [07-25|19:49:29.625] Block synchronisation started
geth_1              | INFO [07-25|19:49:29.943] Disabling direct-ancient mode            origin=3109919 ancient=2818047
geth_1              | INFO [07-25|19:49:30.020] Imported new block headers               count=65 elapsed=39.546ms number=3110009 hash="4448ed…b29e43" ignored=25
geth_1              | INFO [07-25|19:49:30.043] Downloader queue stats                   receiptTasks=0 blockTasks=0 itemSize=555.87B throttle=8192
geth_1              | INFO [07-25|19:49:30.048] Imported new block receipts              count=3  elapsed="313.016µs" number=3109922 hash="9dc0e5…160134" age=22m18s  size=1.79KiB
geth_1              | INFO [07-25|19:49:30.161] Imported new block receipts              count=23 elapsed=9.066ms     number=3109945 hash="54b52f…89191e" age=16m33s  size=59.57KiB
geth_1              | INFO [07-25|19:49:30.297] Imported new state entries               count=17 elapsed="2.3µs"     processed=9496404 pending=241 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:31.241] Looking for peers                        peercount=2 tried=18 static=0
geth_1              | INFO [07-25|19:49:31.668] New local node record                    seq=13 id=821ecfdc008da69a ip=20.43.35.139 udp=30303 tcp=30303
geth_1              | INFO [07-25|19:49:31.753] Imported new state entries               count=241 elapsed="3.3µs"     processed=9496645 pending=1636 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:32.733] Imported new state entries               count=384 elapsed="2.7µs"     processed=9497029 pending=6470 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:33.190] Imported new state entries               count=774 elapsed="965.55µs"  processed=9497803 pending=7903 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:33.434] Imported new state entries               count=421 elapsed="948.349µs" processed=9498224 pending=8227 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:34.028] Imported new state entries               count=1480 elapsed=4.263ms     processed=9499704 pending=7971 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:34.280] Imported new state entries               count=384  elapsed=1.486ms     processed=9500088 pending=8211 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:34.501] Imported new state entries               count=384  elapsed=2.226ms     processed=9500472 pending=8403 retry=0 duplicate=0 unexpected=0
geth_1              | INFO [07-25|19:49:34.891] Imported new state entries               count=1152 elapsed=4.103ms     processed=9501624 pending=8554 retry=0 duplicate=0 unexpected=0
validator_client_1  | Jul 25 19:49:35.009 ERRO Unable connect to beacon node           error: ReqwestError(reqwest::Error { kind: Request, url: "http://beacon_node:5052/node/syncing", source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error: "failed to lookup address information: Name or service not known" })) }), service: notifier

I'm also running a mainnet and ropsten nodes on the same VM, but I changed the ports in the docker-compose.yml to avoid any conflicts.

Here is the docker-compose.yml :

version: "3.0"

services:
    beacon_node:
        image: sigp/lighthouse:latest
        volumes:
            - ./.lighthouse:/root/.lighthouse
            - ./scripts:/root/scripts
            - ./network:/root/network
        ports:
            - 5052:5052/tcp
            - 5053:5053/tcp
            - 9000:9000/tcp
            - 9000:9000/udp
        env_file: .env
        command: sh /root/scripts/start-beacon-node.sh
    validator_client:
        image: sigp/lighthouse:latest
        volumes:
            - ./.lighthouse:/root/.lighthouse
            - ./scripts:/root/scripts
        depends_on:
            - beacon_node
        env_file: .env
        command: sh /root/scripts/start-validator-client.sh
    geth:
        build:
            context: geth
        volumes:
            - geth:/root/.ethereum
            - ./scripts:/root/scripts
        env_file: .env
        ports:
            - 30305:30305/tcp
            - 30305:30305/udp

volumes:
    geth:

Version

I'm running the latest docker image within Ubuntu 18.04 on an azure VM.

I followed the instructions on this page : https://lighthouse-book.sigmaprime.io/become-a-validator-docker.html

michaelsproul commented 4 years ago

Hi @bjaq, I think signal 132 could be a SIGILL exit code due to executing an illegal instruction (based on https://alinex.gitlab.io/concepts/exitcodes/). At a guess, this could be from the new BLS crypto library we've recently integrated (#1335).

What CPU are you running on? Could you please cat /proc/cpuinfo and post the output here?

michaelsproul commented 4 years ago

The last few lines of dmesg output immediately after triggering the bug could also be useful

bjaq commented 4 years ago

Hi @michaelsproul So yesterday I was able to run my beacon node by pulling the stable image instead of latest. Today I tried recreating the bug by pulling the latest image again but everything is running fine. So maybe it was fixed in the meantime.

For reference I'm still going to provide my cpuinfo :

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping        : 4
microcode       : 0xffffffff
cpu MHz         : 2095.147
cache size      : 36608 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4190.29
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping        : 4
microcode       : 0xffffffff
cpu MHz         : 2095.147
cache size      : 36608 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4190.29
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping        : 4
microcode       : 0xffffffff
cpu MHz         : 2095.147
cache size      : 36608 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4190.29
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping        : 4
microcode       : 0xffffffff
cpu MHz         : 2095.147
cache size      : 36608 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4190.29
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:
AgeManning commented 4 years ago

I've seen this issue when compiling the lighthouse binary on another machine and shifting the binary.

Latest code seems to compile to use specific CPU instructions. Compiling locally resolved the issue for me.

Was there a chance the binary was compiled on a machine of virtual machine that wasn't running it?

bjaq commented 4 years ago

I'm using docker so yes indeed the binary was not compiled on my machine. I'm just pulling the pre-compiled binary from the docker hub : https://hub.docker.com/r/sigp/lighthouse.

bjaq commented 4 years ago

I tried again to reproduce the issue without success. Everything is running fine now for me so I will close for now and reopen if it happens again.

AgeManning commented 4 years ago

This occurs when the lighthouse binary is built on one cpu architecture an ran on another.

This is of particular annoyance when building our docker image. We need to build the image in a more general way for all CPU's.

Reopening this as it seems #1416 does not resolve this

michaelsproul commented 4 years ago

I suspect we need to tweak the C compiler options that BLST is compiled with, as that's a recent change likely to have introduced novel CPU instructions

michaelsproul commented 4 years ago

Yep, it's definitely the BLS library. I just ran lighthouse bn under GDB and I get this backtrace on the SIGILL:

#0  0x0000555556ba55a1 in mulx_mont_384 ()
#1  0x0000555556b9bf18 in POINTonE1_Uncompress ()
#2  0x0000555556b8c5d1 in blst::min_pk::PublicKey::uncompress ()
#3  0x0000555556b8a753 in bls::impls::blst::<impl bls::generic_public_key::TPublicKey for blst::min_pk::PublicKe
y>::deserialize ()
#4  0x00005555569357a3 in state_processing::per_block_processing::signature_sets::deposit_pubkey_signature_messa
ge ()
#5  0x00005555565480d3 in eth1::deposit_log::DepositLog::from_log ()
#6  0x0000555555f48fec in <core::iter::adapters::ResultShunt<I,E> as core::iter::traits::iterator::Iterator>::ne
xt ()
#7  0x000055555575b861 in <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter ()

The mulx_mont_384 function is here: https://github.com/supranational/blst/blob/d02b0d86e25d700ce4f4cb0eccac1f743aea6986/build/elf/mulx_mont_384-x86_64.s#L1710 and contains a mulxq instruction that looks suspect, I'll investigate.

andreogle commented 4 years ago

If it's any help, I'm getting this same issue on a fresh Ubuntu 20.04 instance on DigitalOcean