oscar-system / Oscar.jl

A comprehensive open source computer algebra system for computations in algebra, geometry, and number theory.
https://www.oscar-system.org
Other
344 stars 126 forks source link

CI failing on nightly #2476

Closed lgoettgens closed 1 year ago

lgoettgens commented 1 year ago

Since around June 14th, nightly tests start hanging after running for a few minutes and then get killed after 2h30min.

It started some time between https://github.com/oscar-system/Oscar.jl/commit/d5412e96b6b3921d35e50d008862569b19f009b8 and https://github.com/oscar-system/Oscar.jl/commit/2bca8fc10b57949357d2d2c4cf2fb271155bad89.

I would assume this is due to some changes in julia itself, since only nightly fails. I could bisect it using https://github.com/oscar-system/Oscar.jl/actions/runs/5258937597/jobs/9503831992 and https://github.com/oscar-system/Oscar.jl/actions/runs/5263887152/jobs/9514506365 to https://github.com/JuliaLang/julia/compare/320e00db00b...8a1b6422245

benlorenz commented 1 year ago

My guess is that https://github.com/JuliaLang/julia/commit/03c4bc128753a0e34ad560e4cc2faa948e0d9e28#diff-b6ee767647e20ffec70782ae28c0f9d50dc5eb5d2e5285f9d7071064434fe3d9 requires a rebuild of libjulia_jll and some GAP jlls, since it changes some structs in gc.h, cc @fingolfin .

The process seems to run into some deadlock, I can reproduce this locally with just GAP.jl which also randomly gets stuck (just not every time like Oscar.jl, but maybe this is just because the tests are a lot shorter):

  * frame #0: 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183
    frame #1: 0x000014fdceda86f1 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:508
    frame #2: 0x000014fdceda8630 libpthread.so.0`__pthread_cond_wait(cond=0x000014fdce1fc920, mutex=0x000014fdce1fc960) at pthread_cond_wait.c:638
    frame #3: 0x000014fdcdd2dd6a libjulia-internal.so.1`uv_cond_wait(cond=0x000014fdce1fc920, mutex=0x000014fdce1fc960) at thread.c:883
    frame #4: 0x000014fdcdcb2175 libjulia-internal.so.1`jl_safepoint_wait_gc at safepoint.c:173:13
    frame #5: 0x000014fdcdcb1f9b libjulia-internal.so.1`segv_handler [inlined] jl_set_gc_and_wait at julia_internal.h:945:5
    frame #6: 0x000014fdcdcb1f7c libjulia-internal.so.1`segv_handler at signals-unix.c:351:9
    frame #7: 0x000014fdcdcb1f12 libjulia-internal.so.1`segv_handler(sig=<unavailable>, info=<unavailable>, context=0x000014fdc1ff93c0) at signals-unix.c:338:24
    frame #8: 0x000014fdcedad8c0 libpthread.so.0`__restore_rt
    frame #9: 0x000014fdcdc68de4 libjulia-internal.so.1`jl_gc_state_save_and_set at julia_threads.h:348:9
    frame #10: 0x000014fdcdc68de0 libjulia-internal.so.1`jl_gc_state_save_and_set [inlined] jl_gc_state_set(old_state='\x01', state='\0', ptls=0x0000000000b90b60) at julia_threads.h:341:22
    frame #11: 0x000014fdcdc68de0 libjulia-internal.so.1`jl_gc_state_save_and_set(ptls=0x0000000000b90b60, state='\0') at julia_threads.h:354:12
    frame #12: 0x000014fdcdc6945f libjulia-internal.so.1`ijl_sig_throw at task.c:756:5
    frame #13: 0x000014fdcdc69447 libjulia-internal.so.1`ijl_sig_throw at task.c:801:5

Thread list:

(lldb) thread info all
thread #1: tid = 6955, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #2: tid = 6957, 0x000014fdcebedc7c libc.so.6`__GI___sigtimedwait(set=0x000014fdc69fec70, info=0x000014fdc69fecf0, timeout=0x0000000000000000) at sigtimedwait.c:29, name = 'julia', stop reason = signal SIGSTOP

thread #3: tid = 6958, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #4: tid = 6959, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #5: tid = 6960, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #6: tid = 6961, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #7: tid = 6962, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #8: tid = 6963, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #9: tid = 6964, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #10: tid = 6965, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #11: tid = 6966, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #12: tid = 6967, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

thread #13: tid = 6968, 0x000014fdceda870c libpthread.so.0`__pthread_cond_wait at futex-internal.h:183, name = 'julia', stop reason = signal SIGSTOP

Everything waiting, 0% CPU load, this might be due to some memory corruption due to the changed structs?

fingolfin commented 1 year ago

Thank you for filing the issue. I hope we can figure it out soon...

benlorenz commented 1 year ago

This was fixed on julia master now, thanks Max, tests with nightly do work on ubuntu. Unfortunately the upload job for the julia macos nightlies is currently broken, once that is fixed macos should also work again.

Edit: Macos nightly tests also succeeded now.