linux-lts + Electron 1.3.2-3 blank screen

repsac-by commented 8 years ago

Electron 1.3.2-3 lanched with white screen. Atom 1.9.6-1 crashed at startup.

X11 and Wayland the same behavior.

tensor5 commented 8 years ago

Mmm... how about electron -i?

repsac-by commented 8 years ago

@tensor5 it's work

$ electron -i
> process.versions
{ http_parser: '2.7.1',
  node: '6.3.0',
  v8: '5.2.361.43',
  uv: '1.9.1',
  zlib: '1.2.8',
  ares: '1.11.0',
  modules: '49',
  openssl: '1.0.2h',
  electron: '1.3.2',
  'atom-shell': '1.3.2',
  chrome: '52.0.2743.82' }
>

repsac-by commented 8 years ago

The problem is specific to linux-lts 4.4.16 Electron 1.3.2-3 on linux 4.6.4 works

tensor5 commented 8 years ago

That explains why I didn't hit the problem. For curiosity, did Chromium 52 work on linux-lts?

repsac-by commented 8 years ago

linux-lts + chromium 52.0.2743.85-2 all right.

tensor5 commented 8 years ago

You said that electron starts with a blank screen; by chance, are you able to open a developer console there with ctrl+shift+I?

repsac-by commented 8 years ago

ctrl+shift+I nothing happens, just a blank screen

tensor5 commented 8 years ago

Tried linux-lts with electron-1.3.3-1 and confirm the issue. The output of dmesg:

[...]
[   42.479909] traps: electron[1537] trap invalid opcode ip:18a4291 sp:7ffefb311030 error:0 in electron[400000+3b8b000]
[...]

tensor5 commented 8 years ago

It works with electron-1.3.2-2, which makes me think that it has something to do with 256d7b9601d6d061c483e476aeee6afb227473af.

pesho commented 8 years ago

@tensor5, in response to your question here: https://bugs.archlinux.org/task/50357#comment149928 I confirm experiencing the same, and I'm also using linux-lts.

Frizi commented 8 years ago

I confirm, atom crashes at startup and electron alone starts blank. Works on non-lts kernel.

Workaround: install latest version of electron-prebuilt package from npm, atom will pick it up automatically after shell restart.

$ npm install -g electron-prebuilt

remexre commented 8 years ago

I'm getting this bug with the linux-samus4 kernel, uname -r is 4.4.2-6ph. Workaround works here too; can anyone give an explanation for why using the system toolchain would cause as illegal instruction, rather than the inverse? (Assuming it is 256d7b9).

tensor5 commented 8 years ago

I tried reverting 256d7b9601d6d061c483e476aeee6afb227473af, but it doesn't solve the issue.

stefanhusmann commented 8 years ago

The bug seems to be there again in stock Arch Linux kernel 4.7 .

tensor5 commented 8 years ago

It works with 4.7.0-1-ck.

tensor5 commented 8 years ago

@stefanhusmann It works for me also on 4.7.0-1-ARCH. What architecture is your machine?

remexre commented 8 years ago

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

It sounds like we're invoking UB? The actual function that's giving the error looks like:

    retq ; Previous function's return
crash_here: ; Note: named for convenience; no symbol is present in binary
    ud2 ; We crash on this instruction
    nopw %cs:0x0(%rax,%rax,1) ; For alignment?
    nopl (%rax)               ; For alignment?
    retq

We crash on the ud2 instruction, which intentionally causes SIGILL for debugging purposes. gdb has trouble reading the stack, possibly(?) because crash_here never creates a stack frame.

EDIT: In case there's a coorelation:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
Stepping:              4
CPU MHz:               2390.062
CPU max MHz:           3000.0000
CPU min MHz:           500.0000
BogoMIPS:              4788.56
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt

tensor5 commented 8 years ago

This is the diff between the .BUILDINFOs of electron-1.3.2-2 (my old working copy) and 1.3.2-3:

< pkgbuild_sha256sum = 3f266c8d8ceeeefcf72dd4ff585085159823e5b3181a7f916b73ffc25f8d7c09
---
> pkgbuild_sha256sum = 9855352b6780de0be00b6c9f6b435d50937a67709ae386872b3b947f1ca896fe
30c30
< installed = binutils-2.26.1-1
---
> installed = binutils-2.26.1-2
55c55
< installed = fakeroot-1.21-1
---
> installed = fakeroot-1.21-2
63c63
< installed = fontconfig-2.12.0-1
---
> installed = fontconfig-2.12.1-3
68,69c68,69
< installed = gcc-6.1.1-3
< installed = gcc-libs-6.1.1-3
---
> installed = gcc-6.1.1-4
> installed = gcc-libs-6.1.1-4
78c78
< installed = glibc-2.23-5
---
> installed = glibc-2.24-1
126c126
< installed = libcups-2.1.4-1
---
> installed = libcups-2.1.4-2
208c208
< installed = linux-api-headers-4.5.5-1
---
> installed = linux-api-headers-4.7-1
214,215c214,215
< installed = mesa-12.0.1-5
< installed = mesa-libgl-12.0.1-5
---
> installed = mesa-12.0.1-7
> installed = mesa-libgl-12.0.1-7

Maybe the updated linux-api-headers and glibc could play a role here.

remexre commented 8 years ago

I'm guessing glibc, but it's a guess; there's a lot of calls to munmap, mprotect, madvise, etc. in the general neighborhood [EDIT so I assume it's part of malloc, free, etc.]. I'm hand-decompiling right now, I'll update when done. Just for ref. though, alignment rules mean that 0x18a4291 is not the start of the crashing function; I think that it's 0x18a4260, but it might be 0x18a4240 or earlier. The stack frame got all screwed up, so...

EDIT 2 Is there any chance this is compiled with -fomit-frame-pointer? Upstream chromium sounds like it does (ಠ_ಠ), so I assume it does? When I crash, rbp is 4... (which explains the stack issues)

remexre commented 8 years ago

   0x18a4240:   push   %rax
   0x18a4241:   callq  0x33afeb0 <munmap>
   0x18a4246:   test   %eax,%eax
   0x18a4248:   jne    0x18a424c
   0x18a424a:   pop    %rax
   0x18a424b:   retq
   0x18a424c:   ud2
   0x18a424e:   xchg   %ax,%ax
   0x18a4250:   push   %rax
   0x18a4251:   xor    %edx,%edx
   0x18a4253:   callq  0x57a2e0 <mprotect@plt>
   0x18a4258:   test   %eax,%eax
   0x18a425a:   jne    0x18a425e
   0x18a425c:   pop    %rax
   0x18a425d:   retq
   0x18a425e:   ud2
   0x18a4260:   push   %rax
   0x18a4261:   mov    $0x3,%edx
   0x18a4266:   callq  0x57a2e0 <mprotect@plt>
   0x18a426b:   test   %eax,%eax
   0x18a426d:   sete   %al
   0x18a4270:   pop    %rcx
   0x18a4271:   retq
   0x18a4272:   nopw   %cs:0x0(%rax,%rax,1)
   0x18a427c:   nopl   0x0(%rax)
   0x18a4280:   push   %rax
   0x18a4281:   mov    $0x8,%edx
   0x18a4286:   callq  0x5725e0 <madvise@plt>
   0x18a428b:   test   %eax,%eax
   0x18a428d:   jne    0x18a4291
   0x18a428f:   pop    %rax
   0x18a4290:   retq
=> 0x18a4291:   ud2
   0x18a4293:   nopw   %cs:0x0(%rax,%rax,1)
   0x18a429d:   nopl   (%rax)
   0x18a42a0:   retq

I'm translating this roughly to:

void foo(void *addr, size_t len) {
    // 0x18a4240: push   %rax
    // 0x18a4241: callq  0x33afeb0 <munmap>
    // 0x18a4246: test   %eax,%eax
    // 0x18a4248: jne    0x18a424c
    if(munmap(addr, len) == 0) {
        // 0x18a424a: pop    %rax
        // 0x18a424b: retq
        return;
    }
    // 0x18a424c: ud2
    dieWithSIGILL();
}
    // 0x18a424e: xchg   %ax,%ax
void bar(void *addr, size_t len) {
    // 0x18a4250: push   %rax
    // 0x18a4251: xor    %edx,%edx
    // 0x18a4253: callq  0x57a2e0 <mprotect@plt>
    // 0x18a4258: test   %eax,%eax
    // 0x18a425a: jne    0x18a425e
    if(mprotect(addr, len, 0) == 0) {
        // 0x18a425c: pop    %rax
        // 0x18a425d: retq
        return;
    }
    // 0x18a425e: ud2
    dieWithSIGILL();
}
bool baz(void *addr, size_t len) {
    // 0x18a4260: push   %rax
    // 0x18a4261: mov    $0x3,%edx
    // 0x18a4266: callq  0x57a2e0 <mprotect@plt>
    // 0x18a426b: test   %eax,%eax
    // 0x18a426d: sete   %al
    // 0x18a4270: pop    %rcx
    // lolwut why is rax getting moved to rcx?
    // 0x18a4271: retq
    return (mprotect(addr, len, 3) == 0);
}
    // 0x18a4272: nopw   %cs:0x0(%rax,%rax,1)
    // 0x18a427c: nopl   0x0(%rax)
void xyzzy(void *addr, size_t len) {
    // 0x18a4280: push   %rax
    // 0x18a4281: mov    $0x8,%edx
    // 0x18a4286: callq  0x5725e0 <madvise@plt>
    // 0x18a428b: test   %eax,%eax
    // 0x18a428d: jne    0x18a4291
    if(madvise(addr, len, 8) == 0) {
        // 0x18a428f: pop    %rax
        // 0x18a4290: retq
        return;
    }
    // 0x18a4291: ud2
    dieWithSIGILL(); // This is where we die!
    // 0x18a4293: nopw   %cs:0x0(%rax,%rax,1)
    // 0x18a429d: nopl   (%rax)
    // 0x18a42a0: retq
    return; /* Except we return with a borked stack, so we're not returning to the
    caller... Instead, we're returning to the return value of whatever was called
    immediately before xyzzy. */
}

EDIT

The generated assembly looks more like Clang's than GCC's for the xyzzy function. I'm going to try agging the chromium source tree for munmap and friends.

remexre commented 8 years ago

I'm in vendor/node/deps/v8/src/base/platform/platform-posix.cc. From my previous post:

foo is OS::Free
bar is OS::Guard
baz is a mystery... it resembles OS::ProtectCode, but OS::ProtectCode passes PROT_READ | PROT_EXEC to mprotect, and as far as I can tell, 0x3 corresponds to PROT_READ | PROT_WRITE. There's not anywhere else I could find that'd be similar; this might actually be a root cause, if the kernel headers swapped the PROT_WRITE and PROT_EXEC flag values recently. (But I doubt they did...)
xyzzy (our function of interest) is another enigma. It calls madvise, only mentioned in a completely separate file as part of icu-small, which has little to nothing to do with v8's sandboxing... Furthermore, it's called in the file with MADV_RANDOM, while in the core dump it's called with MADV_FREE. (MADV_FREE is not mentioned in electron's tree at all).

At this point, I'd advise calling in a speciali- err, a v8 expert.

EDIT Searching the source tree for ud2, there's a few places in OpenSSL where it's used. None of them look close to icu-small though...

tensor5 commented 8 years ago

@remexre thanks for debugging work :+1:

I will try compiling with gcc next time, that would explain why chromium is not affected by this bug.

remexre commented 8 years ago

No problem; I need more practice with assembly weirdness for machine architecture class :stuck_out_tongue: I'm more than willing to help test, just comment here when you update the PKGBUILD.

stefanhusmann commented 8 years ago

I'm at 64 bit.

remexre commented 8 years ago

@stefanhusmann, could you post the output of lscpu and uname -r?

NicoHood commented 8 years ago

I do not have this bug with 4.7-1. However I got a totally different (unrelated) bug that my laptop display does not work at all. Oh boy...

remexre commented 8 years ago

@repsac-by, @pesho, @Frizi, could you all also post the lscpu and uname -r outputs?

repsac-by commented 8 years ago

@remexre

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD FX(tm)-8320 Eight-Core Processor
Stepping:              0
CPU MHz:               1700.000
CPU max MHz:           3500.0000
CPU min MHz:           1400.0000
BogoMIPS:              7049.48
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

4.7.0-1-ARCH

with 4.7.0-1-ARCH I have no problems

remexre commented 8 years ago

For me (my lscpu is above):

Kernel	`uname -r`	Crash?
`linux-samus4`	`4.4.2-6ph`	Yes
`linux`	`4.4.5-1-ARCH`	Yes
`linux-lts`	`4.4.16-1-lts`	Yes
`linux`	`4.5.0-1-ARCH`	No
`linux`	`4.7.0-1-ARCH`	No

I might try building a kernel 4.7.0 with the same .config as linux-lts, and vice versa, and see if that makes a difference. If not, I'm going to try to see which kernel version it is that causes the issue.

EDIT Apparently I forgot that .config changes between kernel versions... I'm going to try stepping backward through the prebuilt kernel releases until I get it. Also, someone on the LLVM cfe-dev mailing list suggested that I try compiling with -save-temp; apparently, it might be a call to __builtin_trap() that's doing it. So that's what I'll be trying next.

EDIT 2

ಠ_ಠ DEFINE_BOOL(hard_abort, true, "abort by crashing")

I'm guessing this gives a 0.5% performance increase when crashing? mutters darkly

pesho commented 8 years ago

@remexre sure:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Model name:            Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
Stepping:              7
CPU MHz:               1269.531
CPU max MHz:           2900,0000
CPU min MHz:           800,0000
BogoMIPS:              3991.18
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsaveopt

# uname -r
4.4.16-1-lts

remexre commented 8 years ago

gg 1hr of compiling later, OOM... I should probably build outside of a memory-limited VM...

@repsac-by, @pesho, thanks! If we're getting the issue on AMD and Intel, across several microarchitectures, it's probably nothing related to an "actual" invalid opcode anywhere...

tensor5 commented 8 years ago

No good news with GCC either.

remexre commented 8 years ago

Is it related to one of our patches, then?

tensor5 commented 8 years ago

It may be, although I lean more towards the upgraded glibc. Upstream binaries are built using a sysroot, that could be the reason why they are not affected. I'm setting up a build server, so that I will be able to handle rebuilds much more quickly.

tensor5 commented 8 years ago

@remexre I recompiled with debugging, and now I have this extra information:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x00000000018a56d1 in WTF::decommitSystemPages(void*, unsigned long) ()

decommitSystemPages

Does this tell you anything?

tensor5 commented 8 years ago

Can it be the MADV_FREE? man madvise says that it's been introduced in Linux 4.5.

repsac-by commented 8 years ago

@tensor5 chromium from repo used patch which disables MADV_FREE Probably for this reason it works on linux-lts 4.4.

tensor5 commented 8 years ago

@repsac-by Thanks for pointing at that 👍, I'll include that patch in the next release.

tensor5 commented 8 years ago

For the record, this is the diff of /usr/include/bits/mman-linux.h between glibc 2.23-5 and 2.24-2:

83a84
> # define MADV_FREE      8 /* Free pages only if memory pressure.  */

This line and the diff above explain why the older electron compiled with the older glibc worked.

remexre commented 8 years ago

Okay, I must've missed that somehow; I think ag only searches submodules if they're not .gitignore'd? Still not sure why this'd get a ud2, but if Atom works on old kernels again, close?

tensor5 commented 8 years ago

I guess ud2 is generated by RELEASE_ASSERT.

remexre commented 8 years ago

Right, because __builtin_trap(). I've still got ud2 internalized as "undefined behavior," rather than "probably but maybe also these other ten things." :P

tensor5 commented 8 years ago

Feel free to reopen if the problem persists.

tensor5 / arch-atom

linux-lts + Electron 1.3.2-3 blank screen #34