Heap corruption on 32bit systems

pdtwonotes commented 7 years ago

My Main actor creates another actor. The first line of the other actor is never executed, but the library code goes into a loop. This only happens on ARM - the same code works correctly on X86_64. I interrupt the loop with ^C to use the debugger to see the stack:

Process 22481 stopped
* thread #1: tid = 22481, 0x76e0a454 libpthread.so.0`__libc_do_syscall + 4, name
 = 'tokara', stop reason = signal SIGSTOP
    frame #0: 0x76e0a454 libpthread.so.0`__libc_do_syscall + 4
libpthread.so.0`__libc_do_syscall:
->  0x76e0a454 <+4>: svc    #0x0
    0x76e0a456 <+6>: pop    {r7, pc}

libpthread.so.0`__aeabi_unwind_cpp_pr0:
    0x76e0a458 <+0>: push   {r3, lr}
    0x76e0a45a <+2>: blx    0x76e000c8                ;  + 880
(lldb) bt
* thread #1: tid = 22481, 0x76e0a454 libpthread.so.0`__libc_do_syscall + 4, name
 = 'tokara', stop reason = signal SIGSTOP
  * frame #0: 0x76e0a454 libpthread.so.0`__libc_do_syscall + 4
    frame #1: 0x76e024ac libpthread.so.0`pthread_join + 128
    frame #2: 0x00060430 tokara`pony_thread_join + 8
    frame #3: 0x0005eca4 tokara`ponyint_sched_shutdown + 28
    frame #4: 0x0005f1f2 tokara`ponyint_sched_start + 110
    frame #5: 0x0005f3ec tokara`pony_start + 16
    frame #6: 0x76ceb8aa libc.so.6`__libc_start_main + 158
    frame #7: 0x0000a438 tokara`_start + 32

pdtwonotes commented 7 years ago

See issue #1000 where an interlock problem specific to ARM was causing a segfault.

SeanTAllen commented 7 years ago

On IRC @pdtwonotes you said that #1505 is where this started to happen to you, yes?

If that is the case then the problem already existed it was just being hidden or less likely due to a timing issue.

pdtwonotes commented 7 years ago

Some more clues:

In setting up the new actor, the generated code calls pony_create which creates the actor. Then it wants to send a message - I am assuming this is the actual create method invocation, which by the way has 9 parameters. So it calls pony_alloc_msg in the runtime library. This in turn calls ponyint_pool_alloc. That call (ponyint_pool_alloc) does not return.

pdtwonotes commented 7 years ago

Rebuilding pony with config=debug provides more information, due to the assertions buried in the runtime code. The first line is the actual failure.

tokara: src/libponyrt/mem/pool.c:545: pool_pull: Assertion `top->length == global->count' failed.
Process 5943 stopped
* thread #2: tid = 5947, 0x76cebaf6 libc.so.6, name = 'tokara', stop reason = signal SIGABRT
    frame #0: 0x76cebaf6 libc.so.6
->  0x76cebaf6:      blmi   0x76ddb0fe
libc.so.6`__errno_location:
    0x76cebafa <+2>: mrc    p15, #0x0, r0, c13, c0, #0x3
    0x76cebafe <+6>: add    r3, pc
    0x76cebb00 <+8>: ldr    r3, [r3]
(lldb) bt
* thread #2: tid = 5947, 0x76cebaf6 libc.so.6, name = 'tokara', stop reason = signal SIGABRT
  * frame #0: 0x76cebaf6 libc.so.6
    frame #1: 0x00058628 tokara`handle_message(ctx=<unavailable>, actor=<unavailable>, msg=<unavailable>) + 240 at actor.c:103

pdtwonotes commented 7 years ago

I narrowed the suspects down to transaction d715a6bae "Optimise count zeros operations in the runtime (#1443)" since it had to do with counting in the heap, and the assertion failure also had to do with counting in the heap. And the time of the change also happened during a period when I was not regularly rebuilding my ARM compiler.

So as an experiment I undid just that one transaction, involving the three files platform.h, pool.c, and heap.c and rebuilt ponyc on ARM. The problem went away.

I do not know enough about either Pony's heap structures or ARM instruction fencing to suggest a proper fix but the bug is somewhere in that transaction. I have changed the issue title to more properly reflect the cause.

Praetonus commented 7 years ago

@pdtwonotes Thanks for investigating that. The change in pool.c is indeed wrong on 32 bits systems. I'll make a fix right away.

pdtwonotes commented 7 years ago

I still get the assertion failure. I will check again that I am testing it correctly.

Edit: I rebuilt ponyc on my i686 machine using config=debug and it also gets the assertion failure. It just never got the looping on actor startup.

More: Removing the original commit from November fixes the assertion failure on both ARM and i686. The math involving word length was clearly a problem, but that is not sufficient to fix this, it seems.

Praetonus commented 7 years ago

Can you try reverting only one part of the change, to see which function is causing the problem?

pdtwonotes commented 7 years ago

It would be a complete shot in the dark. I do not know how these things interact:

A change in the builtin function that counts bits in a word. Left-to-right or right-to-left?
Whether a +1 is added to the resulting count. Fencepost errors?
What are these bitmasks in the first place? A map to used/allocated pages in the heap?
What was the purpose of the change?

The source code in the RT library has very few comments explaining what is going on.

Praetonus commented 7 years ago

A change in the builtin function that counts bits in a word. Left-to-right or right-to-left?

It's FFS (Find First Set) and CTZ (Count Trailing Zeros) so it starts from the least-significant bit.

Whether a +1 is added to the resulting count. Fencepost errors?

This is because ffs(x) returns ctz(x) + 1 for a non-zero x.

What are these bitmasks in the first place? A map to used/allocated pages in the heap?

It's a map to used slots in a chunk of heap memory.

What was the purpose of the change?

Speed. FFS checks for 0 while CTZ doesn't, and slots isn't supposed to be 0 here.

The source code in the RT library has very few comments explaining what is going on.

That's true, it's an area that can be improved. In the meantime, you can take a look at @sylvanc's thesis, it contains a lot of explanations on the runtime and its implementation.

For your tests, you can try leaving heap.c in its current state (with the call to __pony_ctz) but reverting pool.c to use ponyint_next_pow2 and __pony_ffsl. I think you'll have to edit platform.h to have the definitions of both __pony_ctz and __pony_ffsl.

pdtwonotes commented 7 years ago

Reverting just pool.c (and the small change to platform.h) as you described does appear to fix the problem. I tested it on ARM.

Praetonus commented 7 years ago

Something else is wrong in that change then. I'll look into it. Thanks for testing all that.

dipinhora commented 7 years ago

@Praetonus I don't have time to test this out but I think I see the error.

The line (https://github.com/ponylang/ponyc/commit/d715a6bae7bfce9719e630d1ec35a1846bdf6554#diff-174c1821e490b93d3d279cc44a9a4268R703)

+ return ((size_t)64 - __pony_clzl(size)) - POOL_MIN_BITS;

should be

+ return ((size_t)64 - __pony_ctzl(size)) - POOL_MIN_BITS;

in commit d715a6b because you want to count trailing zeros when replacing ffs as you did in heap.c.

This combined with your change in PR #1525 to use the appropriate bit count depending on ILP32 should resolve this issue.

I should note though, that I've only got a shallow understanding of the code/changes so I may be misunderstanding the change from ffs to clz.

dipinhora commented 7 years ago

So, ummm, I thought some more about this (which I probably should have done prior to posting my previous comment) and my previous comment was completely wrong. Please ignore. Also, apologies for the unnecessary noise.

The only thing that jumps out to me at this point is the use of __pony_clzl even on ILP32 (with the changes from PR #1525). it seems that __pony_clz should be used on ILP32 instead. Not sure if that makes a difference as I'm unable to find information on the behavior of __pony_clzl on ILP32.

Praetonus commented 7 years ago

The only thing that jumps out to me at this point is the use of pony_clzl even on ILP32 (with the changes from PR #1525). it seems that __pony_clz should be used on ILP32 instead. Not sure if that makes a difference as I'm unable to find information on the behavior of pony_clzl on ILP32.

The difference on Linux/OSX is that __pony_clz takes an int while __pony_clzl takes a long. Since size_t on ILP32 is the same size as long, I think clzl is the correct version to use.

I found another problem with the original change. If the size is a power of two, the resulting index will be off by one. I'll push a fix for that as well.

pdtwonotes commented 7 years ago

This latest fix seems to work. Tested on ARM - i will do i686 tomorrow

pdtwonotes commented 7 years ago

Test on i686 gives different results. My program goes into a CPU loop on startup. But if I run it with GDB to find where, instead I get a SegFault, again in the message processing during startup of the main server actor, but in a different place:

Thread 4 "tokara" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb4b68b40 (LWP 4509)]
0x080bdf65 in ponyint_messageq_push (q=0xb7b69c05, m=0xb31ffc20)
    at src/libponyrt/actor/messageq.c:75
75    atomic_store_explicit(&prev->next, m, memory_order_release);
(gdb) bt
#0  0x080bdf65 in ponyint_messageq_push (q=0xb7b69c05, m=0xb31ffc20)
    at src/libponyrt/actor/messageq.c:75
#1  0x080be8c4 in pony_sendv (ctx=0xb7b6ad44, to=0xb7b69c01, m=0xb31ffc20)
    at src/libponyrt/actor/actor.c:342
#2  0x080c7521 in ponyint_cycle_block (ctx=0xb7b6ad44, actor=0xb7b65000, 
    gc=0xb7b65074) at src/libponyrt/gc/cycle.c:889
#3  0x080be54e in ponyint_actor_run (ctx=0xb7b6ad44, actor=0xb7b65000, 
    batch=100) at src/libponyrt/actor/actor.c:198
#4  0x080bcd1f in run (sched=0xb7b6ad00) at src/libponyrt/sched/scheduler.c:290
#5  0x080bcdf1 in run_thread (arg=0xb7b6ad00)
    at src/libponyrt/sched/scheduler.c:341
#6  0xb7d58295 in start_thread (arg=0xb4b68b40) at pthread_create.c:333
#7  0xb7c5feee in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114

pdtwonotes commented 7 years ago

In pool.c pool_pull, line 515, there is a while loop that never terminates. (on x86)

pdtwonotes commented 7 years ago

I have adjusted my environment so I can do all three platforms from a shared copy of the repo, to make sure they have identical code. With the latest fix from Praetonus it works fine on x86_64 and ARM, but not on i686. This is odd, becuase both ARM and i686 are 32-bits. So the difference must lie elsewhere. Ideas: timing? LSP32 defined wrong? ARM and x86_64 are running Ubuntu MATE 16.04.1 and i686 is running Ubuntu Server, 16.04.1, so those differences should be minor.

Praetonus commented 7 years ago

ILP32 is defined as

#if defined(__LP64__)
#  define PLATFORM_IS_LP64
#elif defined(_WIN64)
#  define PLATFORM_IS_LLP64
#else
#  define PLATFORM_IS_ILP32
#endif

which seems right to me.

That unterminating loop is very strange. Can you try running it in a debugger and look at the value of index in pool_get when you hit that loop? The full backtrace could also contain useful data. Thanks.

pdtwonotes commented 7 years ago

If I run it without gdb, it goes into a loop. If I run it under GDB, instead it gets the segment fault. There was always a rule for debugging that if behavior is different with the debugger, then there is uninitialized memory somewhere.

So I tried using the GDB 'attach' command to remotely attach to the looping program. In that case, I get this stack:

Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
0xb7795c31 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7795c31 in __kernel_vsyscall ()
#1  0xb75143fb in pthread_join (threadid=3048360768, thread_return=0x0)
    at pthread_join.c:90
#2  0x080bd99a in pony_thread_join (thread=3048360768)
    at src/libponyrt/platform/threads.c:213
#3  0x080bcac7 in ponyint_sched_shutdown ()
    at src/libponyrt/sched/scheduler.c:353
#4  0x080bcdd9 in ponyint_sched_start (library=false)
    at src/libponyrt/sched/scheduler.c:430
#5  0x080bc387 in pony_start (library=false) at src/libponyrt/sched/start.c:128
#6  0x080b9a0c in main ()

I tried stppping through the creation of the actor, with a breakpoint on pool_get. index values vary between 0 and 33. From the messages coming from the debugger it would appear two threads are busy in the memory allocation code, as sometimes the breakpoint is hit one one thread, and sometimes in another. (CPU is an Intel Atom with 4 threads in two cores). Eventually this stepping gets to the SegFault in messageq_push noted previously.

Praetonus commented 7 years ago

Those index values aren't normal. A pool index shouldn't be higher than 15 in the current implementation. Can you add assert(index < POOL_COUNT); at the beginning of ponyint_pool_alloc (pool.c:605) and see if that assertion fires?

pdtwonotes commented 7 years ago

Sure enough, it does fire. Here is the bt and you can see that pony_alloc_msg is being passed the 33 as well, directly from Main_tag_create_oo at line 78, which happens to be where it makes the call to create another actor.

#0  0xb7fdac31 in __kernel_vsyscall ()
#1  0xb7ba4ea9 in __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#2  0xb7ba6407 in __GI_abort () at abort.c:89
#3  0xb7b9dd07 in __assert_fail_base (
    fmt=0xb7cd8088 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x80de9a1 "index < POOL_COUNT", 
    file=0x80de950 "src/libponyrt/mem/pool.c", line=605, 
    function=0x80de9cc <__PRETTY_FUNCTION__.4686> "ponyint_pool_alloc")
    at assert.c:92
#4  0xb7b9dd8b in __GI___assert_fail (
    assertion=0x80de9a1 "index < POOL_COUNT", 
    file=0x80de950 "src/libponyrt/mem/pool.c", line=605, 
    function=0x80de9cc <__PRETTY_FUNCTION__.4686> "ponyint_pool_alloc")
    at assert.c:101
#5  0x080c1051 in ponyint_pool_alloc (index=33) at src/libponyrt/mem/pool.c:605
#6  0x080be466 in pony_alloc_msg (index=33, id=6)
    at src/libponyrt/actor/actor.c:326
#7  0x080758c6 in Main_tag_create_oo (this=0xb7b69800, env=0xb7b69400)
    at main.pony:78
#8  0x0804d527 in Main_Dispatch ()
#9  0x080bde77 in handle_message (ctx=0xb7b6ac44, actor=0xb7b69800, 
    msg=0xb7b6ab80) at src/libponyrt/actor/actor.c:103
#10 0x080bdffb in ponyint_actor_run (ctx=0xb7b6ac44, actor=0xb7b69800, 
    batch=100) at src/libponyrt/actor/actor.c:158
#11 0x080bc91f in run (sched=0xb7b6ac00) at src/libponyrt/sched/scheduler.c:290
#12 0x080bc9f1 in run_thread (arg=0xb7b6ac00)
    at src/libponyrt/sched/scheduler.c:341
#13 0xb7d58295 in start_thread (arg=0xb5369b40) at pthread_create.c:333
#14 0xb7c5feee in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114

Here is the compiled code from just before the call to pony_alloc_msg and you can clearly see it putting a 33 on the stack.

   0x08075840 <+4288>:  call   0x80be350 <pony_create>
   0x08075845 <+4293>:  mov    $0x21,%ecx
   0x0807584a <+4298>:  mov    $0x6,%edx
   0x0807584f <+4303>:  mov    0xbc(%esp),%esi
   0x08075856 <+4310>:  mov    0xb8(%esp),%edi
   0x0807585d <+4317>:  mov    0xb0(%esp),%ebx
   0x08075864 <+4324>:  mov    0xac(%esp),%ebp
   0x0807586b <+4331>:  mov    %eax,0x98(%esp)
   0x08075872 <+4338>:  mov    0xa0(%esp),%eax
   0x08075879 <+4345>:  mov    %eax,0x94(%esp)
   0x08075880 <+4352>:  mov    0x9c(%esp),%eax
   0x08075887 <+4359>:  movl   $0x21,(%esp)
   0x0807588e <+4366>:  movl   $0x6,0x4(%esp)
   0x08075896 <+4374>:  mov    %edx,0x90(%esp)
   0x0807589d <+4381>:  mov    %ecx,0x8c(%esp)
   0x080758a4 <+4388>:  mov    %eax,0x88(%esp)
   0x080758ab <+4395>:  mov    %esi,0x84(%esp)
   0x080758b2 <+4402>:  mov    %edi,0x80(%esp)
   0x080758b9 <+4409>:  mov    %ebx,0x7c(%esp)
   0x080758bd <+4413>:  mov    %ebp,0x78(%esp)
   0x080758c1 <+4417>:  call   0x80be453 <pony_alloc_msg>

You can see the Pony source code here at line 78. The definition of the actor it is creating is here.

Praetonus commented 7 years ago

The compiler is using ponyint_pool_index to compute that value so we're back to that function. Can you add assert(index < POOL_COUNT); just before returning the value at the end of ponyint_pool_index and see what the results are? In particular, what is the value of size?

pdtwonotes commented 7 years ago

I turns out that the symbol ILP32 is false on this computer, which uname -m identifies as i686. It is indeed a 32-bit processor. This makes ponyint_pool_index assign the wrong value to bits, resulting in an index value that is out of range.

Verified with this test code:

#include <stdio.h>
int main( int agc, const char *arg[] ) {
#ifdef ILP32
  printf("32");
#else
  printf("64");
#endif
}

It prints "64".

I notice that this place in pool.c is the only place in the Pony code that uses ILP32. Other places use PLATFORM_IS_ILP32 which is defined in platform.h

Praetonus commented 7 years ago

You're totally right, it's a typo on my side. I'll fix that.

ponylang / ponyc

Heap corruption on 32bit systems #1517