livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
546 stars 171 forks source link

SIGILL from GMP on Celeron processors while GPU transcoding #1385

Open iameli opened 4 years ago

iameli commented 4 years ago

go-livepeer appears to be crashing on some older CPU architectures — specifically Celeron G3930. Update from comments: this appears to be originating from the GMP we ship within our static go-livepeer binary. It also only shows up when we're doing GPU transcoding. I do not know why.

Error is:

I0214 23:17:53.076123       1 lb.go:75] LB: Creating transcode session for a2e0a4af-1f2c-4f93-b56d-62178e6f4c58
I0214 23:17:53.076168       1 lb.go:109] LB: Created transcode session for a2e0a4af-1f2c-4f93-b56d-62178e6f4c58_0
I0214 23:17:53.076178       1 lb.go:202] LB: Transcode submitted for a2e0a4af-1f2c-4f93-b56d-62178e6f4c58_0
SIGILL: illegal instruction
PC=0x12805a4 m=3 sigcode=2

goroutine 0 [idle]:
runtime: unknown pc 0x12805a4
stack: frame={sp:0x7f1f2f7349f0, fp:0x0} stack=[0x7f1f2b73a288,0x7f1f2f739e88)

Here's a gist with the full logs: https://gist.github.com/iameli/be4ae26f06f906678556f3e91a16e5a7

> cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Celeron(R) CPU G3930 @ 2.90GHz
stepping    : 9
microcode   : 0xca
cpu MHz     : 2900.008
cache size  : 2048 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms invpcid mpx rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 5808.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Celeron(R) CPU G3930 @ 2.90GHz
stepping    : 9
microcode   : 0xca
cpu MHz     : 2900.000
cache size  : 2048 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms invpcid mpx rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 5808.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
iameli commented 4 years ago

Seems very likely this is C rather than Go. I'm going to build everything on one of the machines that's not working and see if it enables different CPU flags and whatnot than a regular build.

iameli commented 4 years ago

Hmm, I had thought this was unrelated to GPU code, but it looks like it only shows up with -nvidia enabled. CPU transcoding works fine.

iameli commented 4 years ago

I barely know what I'm doing with debugging C code and whatnot, but I think this means the problem is GMP trying to make use of CPU opcodes that aren't there.

(gdb) continue
Continuing.

Thread 10 "livepeer" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7f9573fff700 (LWP 115)]
0x0000000001284997 in __gmpn_sqr_basecase ()
(gdb) bt
#0  0x0000000001284997 in __gmpn_sqr_basecase ()
#1  0xfffffffffffffffc in ?? ()
#2  0x0000000000000000 in ?? ()

I've tried compiling with CFLAGS="-mnoavx -mnoavx2" without success.

iameli commented 4 years ago

Confirmed this is a problem with the GMP we ship in the statically-linked binary. Worked around it by using system-provided gnutls, removing the static linking. Not sure how to go about getting a more-compatible gnutls build... presumably some kind of configure flag on GMP and/or gnutls and/or ffmpeg.

It's a mystery to me why this shows up only if we're doing GPU transcoding. GMP is only used as a gnutls dependency, so that implies it was being used when we were making internal HTTPS requests for segments... why wouldn't that happen when CPU transcoding also?

iameli commented 3 years ago

This, or something like it, has started to occur again. Certain orchestrators are getting scheduled onto Celeron processors in ORD and crashlooping. Gotta be... tensorflow?

hthillman commented 2 years ago

@iameli can this be closed? is is the same issue as #2023?