shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.
https://shadow.github.io
Other
1.43k stars 237 forks source link

Test `epoll-writeable-shadow-ptrace` failing in `debug` mode #1563

Closed marcosimioni closed 3 years ago

marcosimioni commented 3 years ago

Describe the issue The following test is consistently failing on my system when I build in debug mode. It does not fail when I build in release mode.


99% tests passed, 1 tests failed out of 157

Total Test time (real) =  49.44 sec

The following tests FAILED:
         30 - epoll-writeable-shadow-ptrace (Timeout)
Errors while running CTest
2021-08-07 07:29:25,822 [INFO] ['ctest', '-j40', '--timeout', '20'] returned 8

Attaching a couple of log files:

(had to split this one, it was exceeding the github attachments limit of 25M) LastTest.log.00.tar.gz LastTest.log.01.tar.gz LastTest.log.02.tar.gz

epoll-writeable-shadow-ptrace.data.tar.gz

To Reproduce Build with:

./setup build --clean --test- debug

And run test with:

./setup test

Operating System (please complete the following information):

Shadow (please complete the following information):

Additional context The epoll-writeable-shadow-ptrace test failure was mentioned here https://github.com/shadow/shadow/issues/1545 and I was asked to create a new issue.

stevenengler commented 3 years ago

From the above logs:

The server exits with exit code 0 and stdout:

########## epoll-writeable test starting ##########
########## epoll-writeable test passed! ##########

The client exits with exit code 137 (sigkill) and stdout:

Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434
Recvd 1434

The test seems to end with:

00:00:00.572480 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [descriptor.c:163] [_descriptor_handleStatusChange] Status changed on desc 3, from ACTIVE|WRITEABLE to ACTIVE|READABLE|WRITEABLE
00:00:00.572487 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [syscall_condition.c:248] [_syscallcondition_logListeningState] Process client.test-epoll-writeable.1000 thread 0x7f7b600062d0 status changed while listening for status on descriptor 3
00:00:00.572499 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [tcp.c:2212] [_tcp_processPacket] done processing in state TCPS_ESTABLISHED
00:00:00.572516 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [syscall_condition.c:248] [_syscallcondition_logListeningState] Process client.test-epoll-writeable.1000 thread 0x7f7b600062d0 signaling while listening for status on descriptor 3
00:00:00.572523 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [syscall_condition.c:248] [_syscallcondition_logListeningState] Process client.test-epoll-writeable.1000 thread 0x7f7b600062d0 stopped listening for status on descriptor 3
00:00:00.572527 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [process.c:566] [process_continue] Continuing thread 1000 in process 1000
00:00:00.572532 [worker-1] 00:00:10.038000000 [DEBUG] [client:11.0.0.1] [process.c:574] [process_continue] switching to thread controller to continue executing process 'client.test-epoll-writeable.1000'
00:00:00.572544 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1030] [threadptrace_resume] THREAD_PTRACE_CHILD_STATE_SYSCALL
00:00:00.572548 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:808] [_threadptrace_handleSyscall] Ptrace not allowing native syscalls
00:00:00.572559 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [syscall_handler.c:169] [_syscallhandler_pre_syscall] SYSCALL_HANDLER_PRE(test-epoll-writeable,pid=1000): handling syscall 45 recvfrom (previously BLOCKed)
00:00:00.572564 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [socket.c:439] [_syscallhandler_recvfromHelper] trying to recv 65535 bytes on socket 3
00:00:00.572578 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [descriptor.c:163] [_descriptor_handleStatusChange] Status changed on desc 3, from ACTIVE|READABLE|WRITEABLE to ACTIVE|WRITEABLE
00:00:00.572594 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [packet.c:688] [packet_addDeliveryStatus] [RCV_SOCKET_DELIVERED] packetID=3:328 11.0.0.2:22222 -> 11.0.0.1:61271 seq=328 ack=1 sack=NA window=121 bytes=1434 header=ACK tsval=10037000000 tsechoreply=10036000000 status=SND_CREATED,SND_TCP_ENQUEUE_THROTTLED,SND_SOCKET_BUFFERED,SND_TCP_ENQUEUE_RETRANSMIT,SND_INTERFACE_SENT,INET_SENT,ROUTER_ENQUEUED,ROUTER_DEQUEUED,RCV_INTERFACE_RECEIVED,RCV_SOCKET_PROCESSED,RCV_TCP_ENQUEUE_UNORDERED,RCV_SOCKET_BUFFERED,RCV_SOCKET_DELIVERED
00:00:00.572609 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [packet.c:688] [packet_addDeliveryStatus] [PDS_DESTROYED] packetID=3:328 11.0.0.2:22222 -> 11.0.0.1:61271 seq=328 ack=1 sack=NA window=121 bytes=1434 header=ACK tsval=10037000000 tsechoreply=10036000000 status=SND_CREATED,SND_TCP_ENQUEUE_THROTTLED,SND_SOCKET_BUFFERED,SND_TCP_ENQUEUE_RETRANSMIT,SND_INTERFACE_SENT,INET_SENT,ROUTER_ENQUEUED,ROUTER_DEQUEUED,RCV_INTERFACE_RECEIVED,RCV_SOCKET_PROCESSED,RCV_TCP_ENQUEUE_UNORDERED,RCV_SOCKET_BUFFERED,RCV_SOCKET_DELIVERED,PDS_DESTROYED
00:00:00.572617 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [tcp.c:2473] [_tcp_receiveUserData] 11.0.0.1:61271 (descriptor 3) <-> 11.0.0.2:22222: receiving 1434 user bytes
00:00:00.572622 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [socket.c:496] [_syscallhandler_recvfromHelper] recv returned 1434
00:00:00.572629 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [syscall_handler.c:199] [_syscallhandler_post_syscall] SYSCALL_HANDLER_POST(test-epoll-writeable,pid=1000): syscall 45 recvfrom result: state=BLOCK->DONE code=1434(n/a)
00:00:00.572635 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [timer.c:317] [timer_setTime] Setting timer value to 0.000000000 seconds and timer interval to 0.000000000 seconds on timer fd -1
00:00:00.572639 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [timer.c:139] [_timer_disarm] timer fd -1 disarmed
00:00:00.572643 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1058] [threadptrace_resume] Restoring registers
00:00:00.572650 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1070] [threadptrace_resume] ptrace resuming with signal 0
00:00:00.572656 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1080] [threadptrace_resume] waiting for next state
00:00:00.572668 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:663] [_threadptrace_hybridSpin] Got ptrace stop
00:00:00.572672 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:146] [_getStopReason] STOPREASON_SYSCALL
00:00:00.572678 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1030] [threadptrace_resume] THREAD_PTRACE_CHILD_STATE_SYSCALL
00:00:00.572682 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:804] [_threadptrace_handleSyscall] Ptrace allowing native syscalls
00:00:00.572686 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1207] [threadptrace_nativeSyscall] threadptrace_nativeSyscall 228
00:00:00.572690 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1159] [_threadptrace_ensureStopped] Not in ipc_syscall; should already be stopped
00:00:00.572695 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:408] [_threadptrace_getregs] Already have regs
00:00:00.572701 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1233] [threadptrace_nativeSyscall] threadptrace_nativeSyscall setting regs: rip=0x7ffff77c2a7a n=228 arg5:140737345534200
00:00:00.572718 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:160] [_getStopReason] STOPREASON_SIGNAL: 5
00:00:00.572731 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1276] [threadptrace_nativeSyscall] threadptrace_nativeSyscall r15:0x0 r14:0x0 r15:0x0 r14:0x0 r13:0x7fffffffebb0 r12:0x555555554f40 rbp:0x7fffffffdcd0 rbx:0x0 r11:0x216 r10:0x7fffffffdf50 r9:0x7ffff77cb8f8 r8:0x48 rax:0xe4 rcx:0x7ffff77c2a7c rdx:0x7fffffffdf50 rsi:0x7fffffffde60 rdi:0x1 orig_rax:0xe4 rip:0x7ffff77c2a7a cs:0x33 eflags:0x216 rsp:0x7fffffffdc60 ss:0x2b fs_base:0x7ffff7fd0fc0 gs_base:0x0 ds:0x0 es:0x0 fs:0x0 gs:0x0
00:00:00.572749 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:160] [_getStopReason] STOPREASON_SIGNAL: 5
00:00:00.572766 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1276] [threadptrace_nativeSyscall] threadptrace_nativeSyscall r15:0x0 r14:0x0 r15:0x0 r14:0x0 r13:0x7fffffffebb0 r12:0x555555554f40 rbp:0x7fffffffdcd0 rbx:0x0 r11:0x316 r10:0x7fffffffdf50 r9:0x7ffff77cb8f8 r8:0x48 rax:0x0 rcx:0x7ffff77c2a7c rdx:0x7fffffffdf50 rsi:0x7fffffffde60 rdi:0x1 orig_rax:0xe4 rip:0x7ffff77c2a7c cs:0x33 eflags:0x216 rsp:0x7fffffffdc60 ss:0x2b fs_base:0x7ffff7fd0fc0 gs_base:0x0 ds:0x0 es:0x0 fs:0x0 gs:0x0
00:00:00.572772 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1279] [threadptrace_nativeSyscall] Native syscall result 0 (Success)
00:00:00.572776 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1070] [threadptrace_resume] ptrace resuming with signal 0
00:00:00.572782 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1080] [threadptrace_resume] waiting for next state
00:00:00.572799 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:663] [_threadptrace_hybridSpin] Got ptrace stop
00:00:00.572803 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:146] [_getStopReason] STOPREASON_SYSCALL
00:00:00.572808 [worker-1] 00:00:10.038000000 [TRACE] [client:11.0.0.1] [thread_ptrace.c:1030] [threadptrace_resume] THREAD_PTRACE_CHILD_STATE_SYSCALL
<end of output>
Test time =  20.31 sec
----------------------------------------------------------
Test Failed.
"epoll-writeable-shadow-ptrace" end time: Aug 07 07:29 IST
"epoll-writeable-shadow-ptrace" time elapsed: 00:00:20
----------------------------------------------------------
stevenengler commented 3 years ago

@marcosimioni Can you try changing this line to say debug rather than trace, and then rebuild and re-test shadow?

https://github.com/shadow/shadow/blob/c5968af250c5b1450aca29a0123411e7870a09b4/src/test/CMakeLists.txt#L49

I think we're logging way too much data (556 MB) and this test times out.

marcosimioni commented 3 years ago

@stevenengler apologies for the late reply.

you're right! changing it to debug on this machine, rebuilding with ./setup build --clean --test --debug and then launching tests with ./setup test works fine, all tests passed.

100% tests passed, 0 tests failed out of 157

for the records, this the processor that is mounted on this machine:

lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:            4
CPU MHz:             800.756
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
stevenengler commented 3 years ago

This should be fixed now with #1597.