Closed bjaq closed 4 years ago
Hi @bjaq, I think signal 132 could be a SIGILL exit code due to executing an illegal instruction (based on https://alinex.gitlab.io/concepts/exitcodes/). At a guess, this could be from the new BLS crypto library we've recently integrated (#1335).
What CPU are you running on? Could you please cat /proc/cpuinfo
and post the output here?
The last few lines of dmesg
output immediately after triggering the bug could also be useful
Hi @michaelsproul So yesterday I was able to run my beacon node by pulling the stable image instead of latest. Today I tried recreating the bug by pulling the latest image again but everything is running fine. So maybe it was fixed in the meantime.
For reference I'm still going to provide my cpuinfo :
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping : 4
microcode : 0xffffffff
cpu MHz : 2095.147
cache size : 36608 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4190.29
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping : 4
microcode : 0xffffffff
cpu MHz : 2095.147
cache size : 36608 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4190.29
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping : 4
microcode : 0xffffffff
cpu MHz : 2095.147
cache size : 36608 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4190.29
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
stepping : 4
microcode : 0xffffffff
cpu MHz : 2095.147
cache size : 36608 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4190.29
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:
I've seen this issue when compiling the lighthouse binary on another machine and shifting the binary.
Latest code seems to compile to use specific CPU instructions. Compiling locally resolved the issue for me.
Was there a chance the binary was compiled on a machine of virtual machine that wasn't running it?
I'm using docker so yes indeed the binary was not compiled on my machine. I'm just pulling the pre-compiled binary from the docker hub : https://hub.docker.com/r/sigp/lighthouse.
I tried again to reproduce the issue without success. Everything is running fine now for me so I will close for now and reopen if it happens again.
This occurs when the lighthouse binary is built on one cpu architecture an ran on another.
This is of particular annoyance when building our docker image. We need to build the image in a more general way for all CPU's.
Reopening this as it seems #1416 does not resolve this
I suspect we need to tweak the C compiler options that BLST is compiled with, as that's a recent change likely to have introduced novel CPU instructions
Yep, it's definitely the BLS library. I just ran lighthouse bn
under GDB and I get this backtrace on the SIGILL:
#0 0x0000555556ba55a1 in mulx_mont_384 ()
#1 0x0000555556b9bf18 in POINTonE1_Uncompress ()
#2 0x0000555556b8c5d1 in blst::min_pk::PublicKey::uncompress ()
#3 0x0000555556b8a753 in bls::impls::blst::<impl bls::generic_public_key::TPublicKey for blst::min_pk::PublicKe
y>::deserialize ()
#4 0x00005555569357a3 in state_processing::per_block_processing::signature_sets::deposit_pubkey_signature_messa
ge ()
#5 0x00005555565480d3 in eth1::deposit_log::DepositLog::from_log ()
#6 0x0000555555f48fec in <core::iter::adapters::ResultShunt<I,E> as core::iter::traits::iterator::Iterator>::ne
xt ()
#7 0x000055555575b861 in <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter ()
The mulx_mont_384
function is here: https://github.com/supranational/blst/blob/d02b0d86e25d700ce4f4cb0eccac1f743aea6986/build/elf/mulx_mont_384-x86_64.s#L1710 and contains a mulxq
instruction that looks suspect, I'll investigate.
If it's any help, I'm getting this same issue on a fresh Ubuntu 20.04 instance on DigitalOcean
Description
When running lighthouse with docker-compose in Ubuntu, my beacon node container exit quickly after starting, with an exit code 132. I see no errors in the logs apart from the docker exit code (see below). Geth and the validator client are running fine. I'm a bit lost.
I'm also running a mainnet and ropsten nodes on the same VM, but I changed the ports in the docker-compose.yml to avoid any conflicts.
Here is the docker-compose.yml :
Version
I'm running the latest docker image within Ubuntu 18.04 on an azure VM.
I followed the instructions on this page : https://lighthouse-book.sigmaprime.io/become-a-validator-docker.html