nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.54k stars 29.03k forks source link

async-hooks/test-fseventwrap - Flaky on AIX #13577

Closed mhdawson closed 7 years ago

mhdawson commented 7 years ago

Looks like test-fseventwrap fails intermitently on AIX:

https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/6463/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.185
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
  ...

I believe I've seen it on at least one other PR unrelated to fswatch/asyncwrap.

@gireeshpunathil can you take a look.

mhdawson commented 7 years ago

@nodejs/platform-aix

refack commented 7 years ago

Seen again https://ci.nodejs.org/job/node-test-commit-aix/6442/nodes=aix61-ppc64/

refack commented 7 years ago

/cc @nodejs/async_hooks

gireeshpunathil commented 7 years ago

Recreated (intermittent) the crash in the compiler:

#0  0x000000010050e3d4 in v8::internal::AstValueFactory::GetString(unsigned int, bool, v8::internal::Vector<unsigned char const>) ()
(gdb) where
#0  0x000000010050e3d4 in v8::internal::AstValueFactory::GetString(unsigned int, bool, v8::internal::Vector<unsigned char const>) ()
#1  0x000000010050e9e8 in v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vector<unsigned char const>) ()
#2  0x000000010050e9e8 in v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vector<unsigned char const>) ()
#3  0x000000010052a7ec in v8::internal::Scanner::IsDuplicateSymbol(v8::internal::DuplicateFinder*, v8::internal::AstValueFactory*) const ()
#4  0x0000000100529b5c in v8::internal::ParserBase<v8::internal::Parser>::ParseAndClassifyIdentifier(bool*)
    ()
#5  0x000000010052895c in v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression(bool*, bool*) ()
#6  0x0000000100528040 in v8::internal::ParserBase<v8::internal::Parser>::ParseFormalParameter(v8::internal::ParserFormalParameters*, bool*) ()
#7  0x000000010055fc24 in v8::internal::ParserBase<v8::internal::Parser>::ParseFormalParameterList(v8::internal::ParserFormalParameters*, bool*) ()
#8  0x000000010055f2d8 in v8::internal::Parser::ParseFunction(v8::internal::AstRawString const*, int, v8::internal::FunctionKind, v8::internal::FunctionLiteral::FunctionType, v8::internal::DeclarationScope*, int*, int*, bool*, int*, bool*) ()
#9  0x000000010055e234 in v8::internal::Parser::ParseFunctionLiteral(v8::internal::AstRawString const*, v8::internal::Scanner::Location, v8::internal::FunctionNameValidity, v8::internal::FunctionKind, int, v8::internal::FunctionLiteral::FunctionType, v8::internal::LanguageMode, bool*) ()
#10 0x00000001005271e8 in v8::internal::Parser::DoParseFunction(v8::internal::ParseInfo*, v8::internal::AstRawString const*) ()
#11 0x000000010050c4e8 in v8::internal::Parser::ParseFunction(v8::internal::Isolate*, v8::internal::ParseInfo*) ()
#12 0x000000010050b878 in v8::internal::parsing::ParseFunction(v8::internal::ParseInfo*, v8::internal::Isolate*, bool) ()
#13 0x00000001004bf730 in v8::internal::(anonymous namespace)::GetUnoptimizedCode(v8::internal::CompilationInf---Type <return> to continue, or q <return> to quit---
o*, v8::internal::Compiler::ConcurrencyMode) ()
#14 0x00000001004c1ef4 in v8::internal::(anonymous namespace)::GetLazyCode(v8::internal::Handle<v8::internal::JSFunction>) ()
#15 0x00000001001b5108 in v8::internal::Compiler::Compile(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag) ()
#16 0x0000000100c7f920 in v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*) ()
#17 0x07000000001845f4 in ?? ()
Trott commented 7 years ago

Still a thing...

https://ci.nodejs.org/job/node-test-commit-aix/6522/nodes=aix61-ppc64/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.173
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
gireeshpunathil commented 7 years ago

there are three mutually exclusive issues:

  1. the test case test-fseventwrap.js employs directory watch, which is not supported in AIX
  2. AIX fails to build in debug mode: throws linker error: ld: 0711-317 ERROR: Undefined symbol: vtable for v8::base::PosixTimezoneCache
  3. Crash in v8::internal::AstValueFactory::GetString

Debugging item 3. at the moment.

trevnorris commented 7 years ago

@gireeshpunathil

  1. the test case test-fseventwrap.js employs directory watch, which is not supported in AIX

Why would only this test fail when there are other tests that check directory watch.

gireeshpunathil commented 7 years ago

@trevnorris - this test did not fail due to (1) above (yet) - the test progress is masked off by the crash

gibfahn commented 7 years ago

Why would only this test fail when there are other tests that check directory watch.

The other test files that check directory watch should already be skipped for AIX.

trevnorris commented 7 years ago

The other test files that check directory watch should already be skipped for AIX.

Ah yup. I see the isAix check in those other tests.

gireeshpunathil commented 7 years ago
(dbx) (0x10050e5c0)/20i
0x10050e5c0 (GetString+0x118) eb21ffc8          ld   r25,-56(r1)
0x10050e5c4 (GetString+0x11c) eb41ffd0          ld   r26,-48(r1)
0x10050e5c8 (GetString+0x120) eb61ffd8          ld   r27,-40(r1)
0x10050e5cc (GetString+0x124) eb81ffe0          ld   r28,-32(r1)
0x10050e5d0 (GetString+0x128) 7c0803a6        mtlr   r0
0x10050e5d4 (GetString+0x12c) eba1ffe8          ld   r29,-24(r1)
0x10050e5d8 (GetString+0x130) ebc1fff0          ld   r30,-16(r1)
0x10050e5dc (GetString+0x134) ebe1fff8          ld   r31,-8(r1)
0x10050e5e0 (GetString+0x138) 4e800020         blr
0x10050e5e4 (GetString+0x13c) 60000000         ori   r0,r0,0x0
0x10050e5e8 (GetString+0x140) f8410028         std   r2,0x28(r1)
0x10050e5ec (GetString+0x144) 7f63db78          mr   r3,r27
0x10050e5f0 (GetString+0x148) e9370010          ld   r9,0x10(r23)
0x10050e5f4 (GetString+0x14c) e9490000          ld   r10,0x0(r9)  >>>>> crash
0x10050e5f8 (GetString+0x150) e9690010          ld   r11,0x10(r9)

Failing instruction corresponds to stringtable.LookupOrInsert

dbx) x
  $r0:0x000000010050ec08  $stkp:0x0fffffffffffc330   $toc:0x00000001101d6fc0  
  $r3:0x0fffffffffffc3a0    $r4:0x0000000110acf270    $r5:0x0000000000000001  
  $r6:0x0000000110aa7430    $r7:0x0000000200000000    $r8:0x0000000000000040  
  $r9:0x5b5b5b5b00000000   $r10:0x0000000110ac73b0   $r11:0x0000000000000000  
 $r12:0x0000000022222244   $r13:0x00000001102086a0   $r14:0x0fffffffffffcd08  
 $r15:0x0fffffffffffcd30   $r16:0x0fffffffffffcd28   $r17:0x0fffffffffffcae0  
 $r18:0x0fffffffffffcd28   $r19:0x0000000000000001   $r20:0x0000000000000081  
 $r21:0x0000000110ace9f8   $r22:0x0fffffffffffcd30   $r23:0x0000000110ac4c90  
 $r24:0x0000000200000000   $r25:0x0000000110aa7430   $r26:0x0000000000000001  
 $r27:0x0fffffffffffc3a0   $r28:0x000000009ba358f2   $r29:0x0000000000000008  
 $r30:0x00000000000000c0   $r31:0x0fffffffffffc330  
 $iar:0x000000010050e5f4   $msr:0xa00000000000d032    $cr:0x482222a2  
$link:0x000000010050ec08   $ctr:0x0000000000000000   $xer:0x00000001  

r23 is string_table_ (base::CustomMatcherHashMap), and r9 is supposed to be its vtable from where to look for function LookupOrInsert().

I verified that the object was proper after the construction, looks like later its fields were overlaid by this pattern:

(dbx) ($r23)/8X
0x0000000110ac4c90:  00000001 10ac73b0 00000040 0000002c
0x0000000110ac4ca0:  5b5b5b5b 00000000 00000000 00000000

Pattern 0x5b5b5b5b is typically invalid block eye-catcher. I ran DEBUGMALLOC on allocations and frees, but it did not catch anything in this area. So it looks like an internal memory corruption through a bad pointer. Debugging further.

krydos commented 7 years ago

Just got the same error - https://ci.nodejs.org/job/node-test-commit-aix/6559/nodes=aix61-ppc64/console

Trott commented 7 years ago

Again today:

https://ci.nodejs.org/job/node-test-commit-aix/6663/nodes=aix61-ppc64/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.183
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
gireeshpunathil commented 7 years ago

reduced test case that crashes occasionally in AIX:

const fs = require('fs')
const watcher = fs.watch(__dirname, () => {})
watcher.close()
console.log('done')

The watcher.close() seem to be the one which corrupts the memory (string_table of the parser).

Debugging further.

gireeshpunathil commented 7 years ago

Even with single-thread, memory corruption seem to be a hard thing to crack. At the end of a number of attempts, DEBUGMALLOC finally reported a bad free on a pointer which it never allocated into, at uv__fs_event_close:

(gdb) where
#0  0x09000000007e7eb4 in pthread_kill () from /usr/lib/libpthreads.a(shr_xpg5_64.o)
#1  0x09000000007e770c in _p_raise () from /usr/lib/libpthreads.a(shr_xpg5_64.o)
#2  0x090000000002bab0 in raise () from /usr/lib/libc.a(shr_64.o)
#3  0x090000000007c9b0 in abort () from /usr/lib/libc.a(shr_64.o)
#4  0x09000000000a4cfc in free_y_debugging () from /usr/lib/libc.a(shr_64.o)
#5  0x090000000009e994 in free_common_debugging () from /usr/lib/libc.a(shr_64.o)
#6  0x00000001000318f0 in uv__free (ptr=0x110aacb90) at ../deps/uv/src/uv-common.c:85
#7  0x000000010003067c in uv__fs_event_close (handle=0x16) at ../deps/uv/src/unix/aix.c:877
#8  0x000000010002aeec in uv_close (handle=0x16, close_cb=0x110aacb90) at ../deps/uv/src/unix/core.c:155
#9  0x0000000101127da4 in node::HandleWrap::Close(v8::FunctionCallbackInfo<v8::Value> const&) ()
#10 0x000000010144b62c in node::(anonymous namespace)::FSEventWrap::Close(v8::FunctionCallbackInfo<v8::Value> const&) ()
#11 0x0000000100244e68 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) ()
#12 0x000000010023d068 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#13 0x000000010079afb8 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#14 0x07000000001845f4 in ?? ()
gireeshpunathil commented 7 years ago

In uv__path_is_a_directory the variable handle->dir_filename is freed unconditionally while there exists a control flow wherein this is uninitialized - if the watch event is never fired.

bnoordhuis commented 7 years ago

@gireeshpunathil What does f 10, p *wrap print?

gireeshpunathil commented 7 years ago

This one resolves the issue consistently. I ran it a 3K times, and saw no failures.

diff --git a/deps/uv/src/unix/aix.c b/deps/uv/src/unix/aix.c
index 388c9cc..6314096 100644
--- a/deps/uv/src/unix/aix.c
+++ b/deps/uv/src/unix/aix.c
@@ -855,6 +855,7 @@ int uv_fs_event_start(uv_fs_event_t* handle,
   uv__io_init(&handle->event_watcher, uv__ahafs_event, fd);
   handle->path = uv__strdup(filename);
   handle->cb = cb;
+  handle->dir_filename = NULL;

   uv__io_start(handle->loop, &handle->event_watcher, POLLIN);

@@ -874,8 +875,10 @@ int uv_fs_event_stop(uv_fs_event_t* handle) {
   uv__handle_stop(handle);

   if (uv__path_is_a_directory(handle->path) == 0) {
-    uv__free(handle->dir_filename);
-    handle->dir_filename = NULL;
+    if (handle->dir_filename != NULL) {
+      uv__free(handle->dir_filename);
+      handle->dir_filename = NULL;
+    }
   }

   uv__free(handle->path);

Leaving it for a while for some review and then come up with a PR.

gireeshpunathil commented 7 years ago

@bnoordhuis - I overwrote the core file. (will get it once again.)

bnoordhuis commented 7 years ago

@gireeshpunathil uv__free(ptr) is a no-op when ptr == NULL, you shouldn't need a guard. Looks like it's just a case of improper initialization (and a rather obvious one, in retrospect, but they always are.)

gireeshpunathil commented 7 years ago

@bnoordhuis - Looks like the values are optimized out in the release build. Does the raw offsets reveal / confirm anything?

(gdb) p * wrap
No symbol "wrap" in current context.
(gdb) p *(FSEventWrap *) 0xfffffffffffdcf8
No symbol "FSEventWrap" in current context.
(gdb) p *(node::FSEventWrap *) 0xfffffffffffdcf8
No symbol "node" in current context.
(gdb) x/10x 0xfffffffffffdcf8
0xfffffffffffdcf8:      0x00000004      0x00000000      0x07000000      0x208d3d01
0xfffffffffffdd08:      0x00000000      0x00000016      0x07000000      0x002bbf00
0xfffffffffffdd18:      0x0fffffff      0xffffdd68
(gdb) 

Thanks for the info on the uv__free. So that means the only change is the initializer for the uninitialized field.

gireeshpunathil commented 7 years ago

item no. 3 is being addressed through libuv PR 1400 Next - item no. 2: AIX fails to build in debug mode: throws linker error: ld: 0711-317 ERROR: Undefined symbol: vtable for v8::base::PosixTimezoneCache /cc @jbajwa

gireeshpunathil commented 7 years ago

-O3 build fine, while -O0 throws up the above error.