halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

Multi-target generated filter attempts to use AVX instructions on non-AVX system #4041

Open SanderVocke opened 5 years ago

SanderVocke commented 5 years ago

Hi,

I have built an application using a Halide-generated filter. Halide 1:2018.02.15-1 was used. The filter was generated for the following targets:

The application runs without any issue on a VirtualBox Ubuntu VM when AVX and AVX2 are enabled. However, when I disable AVX and AVX2 on the VM and run the application again I receive the following message: [ 555.468199] traps: my-application[1648] trap invalid opcode ip:d0360c sp:7efd85ff7250 error:0 in my-application[400000+10c2000]

I looked for the d0360c in the object by objdump -d my-appliction | grep -i and it resulted in: d0360c: c5 fb 11 44 24 08 vmovsd %xmm0,0x8(%rsp)

The cpu flags after disabling AVX and AVX2 are checked by cat /proc/cpuinfo | grep flags and the result is: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase invpcid rdseed flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase invpcid rdseed flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase invpcid rdseed flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase invpcid rdseed

I also checked the "cpuid" command on the VM (which I believe the Halide runtime uses for determining target capabilities). That also reports AVX to be not supported.

It seems that Halide does not choose the appropriate target to execute the code when AVX and AVX2 are disabled.

Am I missing something obvious here?

steven-johnson commented 5 years ago

"linux-x86-64,linux-x86-64-sse41,linux-x86-64-sse41-avx".

One issue here is that the code searches in the order specified (left to right), stopping at the first subtarget that is determined to be safe to use at runtime; this ordering should (in theory) only give you base x86-64, never using avx. What you want is

"linux-x86-64-sse41-avx,linux-x86-64-sse41,linux-x86-64".

...that said, that makes this even weirder. Can you tell where the AVX instruction is being used? Or which one is being used? Can you replicate it with a trivial example (which you could post)? Can you replicate it if you specify just "linux-x86-64"?

SanderVocke commented 5 years ago

Thanks Steven, I'm looking into trying to make a minimal example of this and getting some extra information.

SanderVocke commented 5 years ago

One thing I have noticed is that not a single vmovsd instruction ends up in the application binary when I remove the AVX target from the list. So the AVX instruction that was triggered is definitely one generated by Halide, though I get that doesn't help much.

I will look at making a reproducible piece of code available.

SanderVocke commented 5 years ago

The instruction triggered is at line 2c:

0000000000000000 <gaussian_blur_halide_gen_impl1>:
   0:   41 57                   push   %r15
   2:   41 56                   push   %r14
   4:   53                      push   %rbx
   5:   48 83 ec 20             sub    $0x20,%rsp
   9:   49 89 f6                mov    %rsi,%r14
   c:   48 89 fb                mov    %rdi,%rbx
   f:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 16 <gaussian_blur_halide_gen_impl1+0x16>
  16:   48 85 c0                test   %rax,%rax
  19:   74 11                   je     2c <gaussian_blur_halide_gen_impl1+0x2c>
  1b:   48 89 df                mov    %rbx,%rdi
  1e:   4c 89 f6                mov    %r14,%rsi
  21:   48 83 c4 20             add    $0x20,%rsp
  25:   5b                      pop    %rbx
  26:   41 5e                   pop    %r14
  28:   41 5f                   pop    %r15
  2a:   ff e0                   jmpq   *%rax
  2c:   c5 fb 11 44 24 08       vmovsd %xmm0,0x8(%rsp)
  32:   c5 fb 11 4c 24 10       vmovsd %xmm1,0x10(%rsp)
  38:   c5 fb 11 54 24 18       vmovsd %xmm2,0x18(%rsp)
  3e:   bf 10 00 00 08          mov    $0x8000010,%edi
  43:   e8 00 00 00 00          callq  48 <gaussian_blur_halide_gen_impl1+0x48>
  48:   85 c0                   test   %eax,%eax
  4a:   75 17                   jne    63 <gaussian_blur_halide_gen_impl1+0x63>
  4c:   4c 8b 3d 00 00 00 00    mov    0x0(%rip),%r15        # 53 <gaussian_blur_halide_gen_impl1+0x53>
  53:   bf 00 00 00 08          mov    $0x8000000,%edi
  58:   e8 00 00 00 00          callq  5d <gaussian_blur_halide_gen_impl1+0x5d>
  5d:   85 c0                   test   %eax,%eax
  5f:   75 17                   jne    78 <gaussian_blur_halide_gen_impl1+0x78>
  61:   eb 1c                   jmp    7f <gaussian_blur_halide_gen_impl1+0x7f>
  63:   4c 8b 3d 00 00 00 00    mov    0x0(%rip),%r15        # 6a <gaussian_blur_halide_gen_impl1+0x6a>
  6a:   bf 00 00 00 08          mov    $0x8000000,%edi
  6f:   e8 00 00 00 00          callq  74 <gaussian_blur_halide_gen_impl1+0x74>
  74:   85 c0                   test   %eax,%eax

Looks to me like this is the part where the filter implementation gets selected.

This bit of the assembly looks very different when I reverse the target orders, with no vmovsd instructions at that particular place.

abadams commented 5 years ago

Reading the source, looks like the wrapper code uses the last target in the list, making the assumption that it's the most general one. We could change that code to get it to use the most generic target, or (probably more helpful) we could error out if we detect that any target in the list has an earlier target which is more general, to nudge people into listing them in the desired order.

abadams commented 5 years ago

So Sander: putting the targets in the intended order (most advanced to least advanced) will probably fix the problem.

SanderVocke commented 5 years ago

Sorry for the radio silence. Putting targets in the intended order indeed solved our issue. Thanks for the help!

Feel free to close this issue if you don't plan on making any changes.

steven-johnson commented 5 years ago

I vote to leave it open, as I think that generating a compile error in this case is likely a good solution to prevent this in the future. (I'm not likely to get around to it soon, so a PR to do this would be welcome.)