scipopt / scip

SCIP - Solving Constraint Integer Programs
Other
369 stars 63 forks source link

Flaky Segfault in SCIPsolveConcurrent() #5

Closed Mizux closed 4 weeks ago

Mizux commented 2 years ago

DISCLAIMER: All details (and a minimal reproducible example) are in https://github.com/Mizux/scip-multithread/issues/2

It seems, on GitHub linux hosted worker, the method SCIPsyncstoreGetWinner() will return -1 nearly each time Open Questions:

  • Why code didn't check the index value or retcode before using the return value ?
  • Why we have so many -1 return by SCIPsyncstoreGetWinner() ?
  • Why syncstore->lastsync is empty while SCIPtpiCollectJobs(jobid) seems to have correctly collected all results ?

ref: https://github.com/Mizux/scip-multithread/issues/2#issuecomment-889866195

Mizux commented 2 years ago
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7abe537 in __GI_abort () at abort.c:79
#2  0x00007ffff7abe40f in __assert_fail_base (fmt=0x7ffff7c27128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=<optimized out>) at assert.c:92
#3  0x00007ffff7acd662 in __GI___assert_fail (assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=0x555556147340 <__PRETTY_FUNCTION__.1> "SCIPconcurrentSolve") at assert.c:101
#4  0x0000555555f62d19 in SCIPconcurrentSolve (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518
#5  0x00005555558fbf28 in SCIPsolveConcurrent (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
#6  0x0000555555558eed in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120

Change frame

(gdb) f 4
#4  0x0000555555f62d19 in SCIPconcurrentSolve (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518
518    assert(idx >= 0 && idx < nconcsolvers);

display frame

(gdb) info frame
Stack level 4, frame at 0x7fffffffded0:
 rip = 0x555555f62d19 in SCIPconcurrentSolve (/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518); saved rip = 0x5555558fbf28
 called by frame at 0x7fffffffdfd0, caller of frame at 0x7fffffffde60
 source language c.
 Arglist at 0x7fffffffdec0, args: scip=0x5555561f6eb0
 Locals at 0x7fffffffdec0, Previous frame's sp is 0x7fffffffded0
 Saved registers:
  rbp at 0x7fffffffdec0, rip at 0x7fffffffdec8

display locals

gdb) info locals
syncstore = 0x55555621ad40
idx = -1
jobid = 1
i = 16
retcode = SCIP_OKAY
concsolvers = 0x555558327350
nconcsolvers = 16
__PRETTY_FUNCTION__ = "SCIPconcurrentSolve"

Once you are on a correct frame, you can dereference pointer object etc...

(gdb) p (*syncstore)
$10 = {nuses = 17, mode = SCIP_PARA_DETERMINISTIC, initialized = 1, ninitvars = 3, syncdata = 0x5555587ab5a8, lastsync = 0x0, mainscip = 0x5555561f6eb0, stopped = 0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, nsyncdata = 16, minsyncdelay = 2.7000000000000002, maxnsyncdelay = 7, syncfreqinit = 10, syncfreqmax = 2.7000000000000002, maxnsols = 3, nsolvers = 16}
(gdb) p (*syncstore->lastsync)
Cannot access memory at address 0x0
(gdb) p (syncstore->lastsync)
$11 = (SCIP_SYNCDATA *) 0x0

lastsync = 0x0 i.e. GetWinner seems to return -1 because syncstore->lastsync is null note: retcode is SCIP_OKAY so my guess since retcode = SCIPtpiCollectJobs(jobid); seems ok, IMHO lastsync is not correctly updated accordingly...

matbesancon commented 2 years ago

thanks for the detailed report, we will look into the issue

svigerske commented 2 years ago

Still open. 90804ee referred to an issue 5 in a different system.

PS: I don't have permissions to reopen.

Mizux commented 2 years ago

@matbesancon friendly ping to reopen it and have some update from your side ;)

matbesancon commented 1 year ago

@Mizux there is a bug fix on the branch 3258-scip-concurrent-mode-lacking-an-index-check, can you try to see if this fixes your issue?

Mizux commented 1 year ago

Will give a try tomorrow and write my feedback here ! (unless you prefer to give me access to your gitlab and/or create a user using my (at) google.com address ?)

ps: thx for the patch ;)

matbesancon commented 1 year ago

This issue works perfectly thank you!

On Tue, Jan 17, 2023, 19:27 Mizux @.***> wrote:

Will give a try tomorrow and write my feedback here !

(unless you prefer to give me access to your gitlab and/or create a user using my (at) google.com address ?)

— Reply to this email directly, view it on GitHub https://github.com/scipopt/scip/issues/5#issuecomment-1385849553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2FDMWS7W2HOJMA3ZU2GNDWS3QAHANCNFSM5BMC2BZQ . You are receiving this because you modified the open/close state.Message ID: @.***>

matbesancon commented 1 year ago

(unless you prefer to give me access to your gitlab and/or create a user using my (at) google.com address ?)

It would make sense for you to have an account on the internal gitlab yes :) @Mizux can you send me an email with the email I should use for the account? (on my email on the github profile or lastname at org dot de

matbesancon commented 1 year ago

hi @Mizux did you have time to try the fix?

svigerske commented 4 weeks ago

I assume this has been fixed.