julia-actions / cache

A shortcut action to cache Julia artifacts, packages, and registries.
MIT License
38 stars 8 forks source link

`signal (4.2): Illegal instruction` #114

Open omus opened 3 months ago

omus commented 3 months ago

I've been seeing some strange behaviors when running Julia code after a successful cache restore. I've been seeing these kinds of failures in multiple workflows at seemingly random locations:

Invalid instruction at 0x7fc6b262308d: 0x62, 0xd1, 0xf7, 0x08, 0x7b, 0xde, 0xc5, 0xe1, 0x57, 0x05, 0xf5, 0xe5, 0xfd, 0xff, 0xc5

[2552] signal (4.2): Illegal instruction

I'm still gathering information on this problem but my going theory is the vcvtusi2sd instruction shown from disassembling the hex requires the AVX512F CPU feature and possibly ubuntu-latest runners may switch between AMD and Intel CPUs?

Debugging this has been made more challenging due to #113

omus commented 3 months ago

Another example:

Invalid instruction at 0x75fc0a81d157: 0x62, 0xf2, 0x7d, 0x48, 0x7c, 0xc0, 0x62, 0xf1, 0x7d, 0x48, 0xfe, 0x0d, 0x99, 0x40, 0xfe

[1732] signal (4.2): Illegal instruction
omus commented 3 months ago

I ended up displaying /proc/cpuinfo in my workflow and found that using GitHub hosted runners for ubuntu-latest do indeed switch between Intel and AMD CPUs. In my particular case runs were successful on Intel but not AMD. I suspect the cache from main was original run on Intel.

Intel CPU ``` processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz stepping : 7 microcode : 0xffffffff cpu MHz : 2593.906 cache size : 36608 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data retbleed gds bogomips : 5187.81 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz stepping : 7 microcode : 0xffffffff cpu MHz : 2593.906 cache size : 36608 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data retbleed gds bogomips : 5187.81 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: ```
AMD CPU ``` processor : 0 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3238.877 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.86 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: processor : 1 vendor_id : AuthenticAMD cpu family : 25 model : 1 model name : AMD EPYC 7763 64-Core Processor stepping : 1 microcode : 0xffffffff cpu MHz : 3243.623 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 4890.86 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ```
IanButterworth commented 3 months ago

Yeah we've seen this on 1.9 but I believe it's fixed on 1.10, but haven't confirmed. Seems Julia isn't rejecting caches that were generated on different cpu arches. Is your CI running on different kinds of runners?

omus commented 3 months ago

Yeah we've seen this on 1.9 but I believe it's fixed on 1.10

Good to know. The reported failures are on Julia 1.9.4

I've had luck setting JULIA_CPU_TARGET in Docker images so I may try this as a work around for now:

# Set x86_64 targets for improved compatibility
# https://docs.julialang.org/en/v1/devdocs/sysimg/#Specifying-multiple-system-image-targets
env:
  JULIA_CPU_TARGET: "generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)"