illegal hardware instruction on loading any torch module

nonchip commented 6 years ago

I installed torch as per the documentation, then tried to write a simple rnn script, but it crashed immediately with an illegal instruction.

simplified my code down to be just:

print(1)
require 'torch' -- substitute for any of cunn, nn, rnn, ... essentially any torch module I tried behaves the same
print(2)

I run it with just luajit test.lua

Expected output:

1
2

Actual situation:

1
[1]    14719 illegal hardware instruction  luajit test.lua

running with th instead of luajit behaves the same, in fact, running th with any arguments except --help (which works) also crashes with an illegal instruction.

my CPU:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 10
model name      : AMD Phenom(tm) II X6 1090T Processor
stepping        : 0
microcode       : 0x10000bf
cpu MHz         : 3206.252
cache size      : 512 KB
physical id     : 0
siblings        : 6
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_t
sc cpuid extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg
bogomips        : 6412.50
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb

my GPU: 02:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1) driver versions (and their respective sources):

glxinfo:
  OpenGL version string: 4.5.0 NVIDIA 384.98
  OpenGL shading language version string: 4.50 NVIDIA
  OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 384.98
  OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
nvcc:
  NVIDIA (R) Cuda compilation tools, release 9.1, V9.1.85
/proc/driver/nvidia:
  NVRM version: NVIDIA UNIX x86_64 Kernel Module  387.26  Thu Nov  2 21:20:16 PDT 2017

cuda device info:

/opt/cuda-9.1/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 980 Ti"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 6073 MBytes (6368198656 bytes)
  (22) Multiprocessors, (128) CUDA Cores/MP:     2816 CUDA Cores
  GPU Max Clock rate:                            1240 MHz (1.24 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1, Device0 = GeForce GTX 980 Ti
Result = PASS

tastyminerals commented 6 years ago

This could be because your torch binaries were compiled for a newer processor architecture than the one you are running it on (e.g. a binary compiled with -mavx but running on a pre-Sandy Bridge processor).

As mentioned in https://github.com/torch/torch7/issues/666 (haha, this issue :imp: )

nonchip commented 6 years ago

@tastyminerals nope, it couldn't, since luarocks is supposed to (and seemed to) compile them on the target machine. if it actually didn't for some critical stuff, maybe you should just stop providing binary blobs.

finally got it to run though when creating a completely isolated setup of luajit/rocks/modules, see https://gist.github.com/nonchip/2c93ff2d9bc1bf2cd12bc6e76010da0f for the cuda part and https://github.com/nonchip/bnbot/blob/master/setup.sh for the kind of setup I made in my project.

torch / torch7

illegal hardware instruction on loading any torch module #1129