ssrg-vt / popcorn-kernel

Popcorn Linux kernel for distributed thread execution
Other
156 stars 22 forks source link

arm64 mt example from popcorn-kernel-lib triggers page server panic #69

Closed bxatnarf closed 5 years ago

bxatnarf commented 5 years ago

A page sever panic happens when the mt example from https://github.com/ssrg-vt/popcorn-kernel-lib is executed on arm64. This is likely related to issue https://github.com/ssrg-vt/popcorn-kernel/issues/61.

The kernel panic occurs when pthread_join is called for the final time by the origin process. The LOOPS constant can be set to equal 1.

The kernel log for the origin kernel looks as follows:

[  373.064727]   [292] ->munmap [260/1] ffffa4936000+800000
[  373.092546]   [292] zap ffffa5133000
[  373.092853]   [292] zap ffffa5134000
[  373.093575]   [292] zap ffffa5135000
[  373.108497]   [292] ->munmap [260/1] ffffa4136000+800000
[  373.142179]   [292] zap ffffa4933000
[  373.142473]   [292] zap ffffa4934000
[  373.143412]   [292] zap ffffa4935000
[  373.159882]   [292] ->munmap [260/1] ffffa3936000+800000
[  373.192907]   [292] zap ffffa4133000
[  373.193205]   [292] zap ffffa4134000
[  373.193389]   [292] zap ffffa4135000
[  373.207442]   [292] ->munmap [260/1] ffffa3136000+800000
[  373.222413]   [292] zap ffffa3933000
[  373.228842]   [292] zap ffffa3934000
[  373.229069]   [292] zap ffffa3935000
[  373.249536] 
[  373.249536] ## PAGEFAULT [292] 455000 R 455fb0 54 0
[  373.249988]   [292] fresh at origin. continue
[  373.250282] ------------------ Start panicking -----------------
[  373.252084] page_server_panic: 455000 00000000733c3faf 0 00000000fc01446d 0
[  373.252376] CPU: 0 PID: 292 Comm: mt Not tainted 4.19.0-rc5-popcorn+ #72
[  373.252626] Hardware name: linux,dummy-virt (DT)
[  373.252865] pstate: 60000000 (nZCv daif -PAN -UAO)
[  373.253078] pc : 0000000000455fb0
[  373.253265] lr : 0000000000406e5c
[  373.253442] sp : 0000fffff62462f0
[  373.253618] x29: 0000fffff62462f0 x28: 0000000000000000 
[  373.253953] x27: 0000000000000000 x26: 0000000000000000 
[  373.254272] x25: 0000000000406e24 x24: 0000000000406d6c 
[  373.254587] x23: 0000000000000001 x22: 000000000049aeb0 
[  373.255405] x21: 0000000000000000 x20: 000000000049ac88 
[  373.255723] x19: 0000000000000002 x18: 0000000000000000 
[  373.256055] x17: 0000000000000001 x16: 0000000000000000 
[  373.256372] x15: 00000000004a12f8 x14: 0000000000000000 
[  373.256685] x13: 0000000000000000 x12: 0000000000000012 
[  373.257039] x11: 0101010101010101 x10: 0000000000000000 
[  373.257354] x9 : 0000000000000004 x8 : 0000000000000000 
[  373.257671] x7 : 0000000000000000 x6 : 0000000000000000 
[  373.258000] x5 : 304ad781d8199897 x4 : 00000000004a0e98 
[  373.258317] x3 : 0000000000406e24 x2 : 0000000000000001 
[  373.259116] x1 : 0000000000000000 x0 : 0000000000000000 
[  373.259564] Call trace:
[  373.260283] ------------[ cut here ]------------
[  373.260466] kernel BUG at kernel/popcorn/page_server.c:623!
[  373.261241] Internal error: Oops - BUG: 0 [#1] SMP
[  373.261533] Modules linked in: msg_socket
[  373.261851] CPU: 0 PID: 292 Comm: mt Not tainted 4.19.0-rc5-popcorn+ #72
[  373.262055] Hardware name: linux,dummy-virt (DT)
[  373.262209] pstate: 20000005 (nzCv daif -PAN -UAO)
[  373.262426] pc : page_server_panic+0x100/0x110
[  373.262587] lr : page_server_panic+0x100/0x110
[  373.262796] sp : ffff000009fe3b50
[  373.262931] x29: ffff000009fe3b50 x28: ffff8000f7a20000 
[  373.263129] x27: 0000000000000004 x26: ffff8000f79823a8 
[  373.263329] x25: 0000000000000000 x24: ffff8000f7982300 
[  373.263546] x23: 0000000000000000 x22: 0000000000000000 
[  373.263743] x21: ffff8000f95542a8 x20: 0000000000455000 
[  373.263961] x19: ffff8000f7ae22a8 x18: 0000000000000010 
[  373.264154] x17: 0000000000000000 x16: 0000000000000000 
[  373.264354] x15: ffffffffffffffff x14: ffff000008b80908 
[  373.264554] x13: ffff000089867477 x12: ffff00000986747f 
[  373.264758] x11: ffff000008c56fc0 x10: 0000000000000000 
[  373.264967] x9 : ffff000008c56fe8 x8 : 0000000000000000 
[  373.265166] x7 : 0000000055cad208 x6 : 0000000000000001 
[  373.265366] x5 : 0000000000000000 x4 : 0000000000000000 
[  373.265565] x3 : 0000000000000000 x2 : ffff8000f7a21680 
[  373.265777] x1 : 42f4e4c09f614200 x0 : 0000000000000000 
[  373.266035] Process mt (pid: 292, stack limit = 0x00000000b3e9ef26)
[  373.266254] Call trace:
[  373.266372]  page_server_panic+0x100/0x110
[  373.266532]  __handle_mm_fault+0x30c/0xcd0
[  373.266674]  handle_mm_fault+0x1c0/0x310
[  373.266829]  do_page_fault+0x198/0x530
[  373.266982]  do_translation_fault+0xa4/0xb8
[  373.267124]  do_mem_abort+0x68/0x110
[  373.267250]  do_el0_ia_bp_hardening+0x64/0xa8
[  373.267398]  el0_ia+0x1c/0x20
[  373.267798] Code: f9401000 d287d801 8b010000 97f9a687 (d4210000) 
[  373.268318] ---[ end trace abfafdd25b846f0c ]---
[  373.268656] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:34
[  373.268960] in_atomic(): 0, irqs_disabled(): 128, pid: 292, name: mt
[  373.269192] INFO: lockdep is turned off.
[  373.269325] irq event stamp: 7264
[  373.269489] hardirqs last  enabled at (7263): [<ffff00000811d040>] console_unlock+0x3f0/0x5e8
[  373.269765] hardirqs last disabled at (7264): [<ffff000008080f3c>] do_debug_exception+0xec/0x184
[  373.270210] softirqs last  enabled at (7234): [<ffff0000080814a0>] __do_softirq+0x2a0/0x4c0
[  373.270493] softirqs last disabled at (7227): [<ffff0000080aea1c>] irq_exit+0xe4/0x120
[  373.270893] CPU: 0 PID: 292 Comm: mt Tainted: G      D           4.19.0-rc5-popcorn+ #72
[  373.271139] Hardware name: linux,dummy-virt (DT)
[  373.271304] Call trace:
[  373.271415]  dump_backtrace+0x0/0x1c0
[  373.271553]  show_stack+0x24/0x30
[  373.271682]  dump_stack+0xbc/0xf4
[  373.271824]  ___might_sleep+0x158/0x228
[  373.271982]  __might_sleep+0x58/0x90
[  373.272118]  exit_signals+0x3c/0x260
[  373.272243]  do_exit+0xf0/0xaf0
[  373.272364]  die+0x1e0/0x1f8
[  373.272474]  bug_handler+0x68/0x98
[  373.272597]  brk_handler+0xfc/0x1b0
[  373.272720]  do_debug_exception+0xa4/0x184
[  373.272879]  el1_dbg+0x18/0x78
[  373.273014]  page_server_panic+0x100/0x110
[  373.273158]  __handle_mm_fault+0x30c/0xcd0
[  373.273301]  handle_mm_fault+0x1c0/0x310
[  373.273435]  do_page_fault+0x198/0x530
[  373.273532]  do_translation_fault+0xa4/0xb8
[  373.273632]  do_mem_abort+0x68/0x110
[  373.273725]  do_el0_ia_bp_hardening+0x64/0xa8
[  373.273846]  el0_ia+0x1c/0x20
[  373.287410] EXITED [292] local / 0xb
[  373.287703] TERMINATE [260/1] with 0x11
[  373.311839]   [292] zap 400000
[  373.312090]   [292] zap 401000
[  373.312242]   [292] zap 402000
xjtuwxg commented 5 years ago

I used the merge branch, commit id "13be3cc6108f59487eb1a2fa58f4b3960773e53" for two ARM64 VMs, and run mt from popcorn-kernel-lib. It is good. I tested two cases: 1) THREADS=32, LOOPS=100. 2) THREADS=2, LOOPS=2.

[   97.294419]   [235] zap ffffc17ad000

popcorn@arm01:~$
popcorn@arm01:~$ uname -a
Linux arm01 4.19.0-rc5-popcorn+ #148 SMP Sun Dec 16 22:36:17 EST 2018 aarch64 GNU/Linux
[   95.236460]   [234] zap ffffa7183000
[   95.236825]   [234] zap ffffa7184000

popcorn@arm02:~$ uname -a
Linux arm02 4.19.0-rc5-popcorn+ #148 SMP Sun Dec 16 22:36:17 EST 2018 aarch64 GNU/Linu
bxatnarf commented 5 years ago

Here is the the mt binary that I'm seeing problems with. I tested with the same kernel version. If this mt binary doesn't trigger the bug then we should compare our linux build configurations

bxatnarf commented 5 years ago

The version of mt that triggers the bug has LOOP=1, THREADS=32 I also added a printf statement to print out the current loop number, but that shouldn't make a difference.

diff --git a/lib/migrate.c b/lib/migrate.c
index 552ac34..a062900 100644
--- a/lib/migrate.c
+++ b/lib/migrate.c
@@ -10,22 +10,22 @@
 #include <string.h>

 #ifdef __x86_64__
-#define SYSCALL_POPCORN_MIGRATE    330
-#define SYSCALL_POPCORN_PROPOSE_MIGRATION  331
-#define SYSCALL_POPCORN_GET_THREAD_STATUS  332
-#define SYSCALL_POPCORN_GET_NODE_INFO  333
+#define SYSCALL_POPCORN_MIGRATE    335
+#define SYSCALL_POPCORN_PROPOSE_MIGRATION  336
+#define SYSCALL_POPCORN_GET_THREAD_STATUS  337
+#define SYSCALL_POPCORN_GET_NODE_INFO  338
 #define SYSCALL_GETTID 186
 #elif __aarch64__
-#define SYSCALL_POPCORN_MIGRATE    285
-#define SYSCALL_POPCORN_PROPOSE_MIGRATION  286
-#define SYSCALL_POPCORN_GET_THREAD_STATUS  287
-#define SYSCALL_POPCORN_GET_NODE_INFO  288
+#define SYSCALL_POPCORN_MIGRATE    294
+#define SYSCALL_POPCORN_PROPOSE_MIGRATION  295
+#define SYSCALL_POPCORN_GET_THREAD_STATUS  296
+#define SYSCALL_POPCORN_GET_NODE_INFO  297
 #define SYSCALL_GETTID 178
 #elif __powerpc64__
-#define SYSCALL_POPCORN_MIGRATE    379
-#define SYSCALL_POPCORN_PROPOSE_MIGRATION  380
-#define SYSCALL_POPCORN_GET_THREAD_STATUS  381
-#define SYSCALL_POPCORN_GET_NODE_INFO  382
+#define SYSCALL_POPCORN_MIGRATE    389
+#define SYSCALL_POPCORN_PROPOSE_MIGRATION  390
+#define SYSCALL_POPCORN_GET_THREAD_STATUS  391
+#define SYSCALL_POPCORN_GET_NODE_INFO  392
 #define SYSCALL_GETTID 207
 #else
 #error Does not support this architecture

diff --git a/src/mt.c b/src/mt.c
index ca68a85..5e43cb1 100644
--- a/src/mt.c
+++ b/src/mt.c
@@ -7,7 +7,7 @@
 #include "migrate.h"

 const int THREADS = 32;
-const int LOOPS = 100;
+const int LOOPS = 1;

 pthread_barrier_t barrier_start;
 pthread_barrier_t barrier_end;
@@ -33,6 +33,7 @@ void *child(void *arg)
        int i;

        for (i = 0; i < LOOPS; i++) {
+               printf("--------loop %d ---\n", i);
                printf("Entering %d %d\n", param->tid, tid);
                pthread_barrier_wait(&barrier_start);
                printf("Entered %d %d\n", param->tid, tid);
bxatnarf commented 5 years ago

This bug seems to be fixed in https://github.com/ssrg-vt/popcorn-kernel/commit/bf0c818be9a088502956fc058c99a4783156d376 . mt now runs on arm (even with 32 threads and 32 loops)

.bx