Assert failures at CorrespArr.cuh:142

msoos commented 3 years ago

For example, when solving sc2013.cnf from SAT 2019. Command line ran: ./glucose-gpu -write-stats-period-sec=-1 -max-thread-count=12 -max-memory=45000 -mem-lim=50000 003-sc2013.cnf output:

c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
c Setting block count guideline to 30 (twice the number of multiprocessors)
c |  Automatic Adjustement of the number of solvers. MaxMemory=45000, MaxCores= 12.                       |
c |  One Solver is taking 9.22Mb... Let's take 12 solvers for this run (max 40% of the maxMemory).       |
c |<C2><A0> all clones generated. Memory = 22902.70Mb.                                                             |
c ========================================================================================================|
All solvers launched
Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 16 to 13. Total warps was 480
Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 13 to 10. Total warps was 480
c
c Solver 8: try to adapt solver strategies at 80.958074. There are 16065 learned clauses and 4159 permanently learned
c. 96812 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 6: try to adapt solver strategies at 81.015932. There are 16052 learned clauses and 4334 permanently learned
c. 96668 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 4: try to adapt solver strategies at 81.056530. There are 16055 learned clauses and 4374 permanently learned
c. 96632 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 10: try to adapt solver strategies at 81.357662. There are 16092 learned clauses and 4326 permanently learned
c. 96625 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 0: try to adapt solver strategies at 81.681613. There are 16109 learned clauses and 4512 permanently learned
c. 96879 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 2: try to adapt solver strategies at 82.128450. There are 16151 learned clauses and 4170 permanently learned
c. 96958 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c Reducing gpu clause db, clause count is 928537
c There is 11200036864 free memory out of 11996954624
c Keeping clauses with act >= 106.92
c Done reducing gpu clause db, clause count is 465798
c There is 11203182592 free memory out of 11996954624
Allocating large amount of memory: 536870912
c Reducing gpu clause db, clause count is 3939038
c There is 10144120832 free memory out of 11996954624
c Keeping clauses with act >= 4.01347e-09
c Done reducing gpu clause db, clause count is 1972372
c There is 10169286656 free memory out of 11996954624
c Reducing gpu clause db, clause count is 5393567
c There is 9806479360 free memory out of 11996954624
c Keeping clauses with act >= 2189.67
c Done reducing gpu clause db, clause count is 2699701
c There is 9942794240 free memory out of 11996954624
glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.

The full /usr/bin/time --verbose output is:

Command terminated by signal 6
        Command being timed: "./doalarm -t real 5000 ./glucose-gpu -write-stats-period-sec=-1 -max-thread-count=12 -max-memory=45000 -mem-lim=50000 003-sc2013.cnf"
        User time (seconds): 1550.31
        System time (seconds): 95.01
        Percent of CPU this job got: 472%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:48.37
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3130668
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1507523
        Voluntary context switches: 84517
        Involuntary context switches: 338222
        Swaps: 0
        File system inputs: 0
        File system outputs: 40
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Note that I used doalarm but that should be fine I think? This repeats with other CNFs, too. Here is 001-80-12-sc2014.cnf (also from SAT 2019):

c
c This is glucose-gpu 1.0 --  based on MiniSAT (Many thanks to MiniSAT team)
c
c Setting block count guideline to 30 (twice the number of multiprocessors)
c |  Automatic Adjustement of the number of solvers. MaxMemory=45000, MaxCores= 12.                       |
c |  One Solver is taking 30.87Mb... Let's take 12 solvers for this run (max 40% of the maxMemory).       |
c |<C2><A0> all clones generated. Memory = 23091.96Mb.                                                             |
c ========================================================================================================|
All solvers launched
Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 16 to 13. Total warps was 480
Got error too many resources requested for launch when launching the GPU, decreasing the number of warps per block from 13 to 10. Total warps was 480
c Reducing gpu clause db, clause count is 806313
c There is 11381440512 free memory out of 11996954624
c Keeping clauses with act >= 75.8506
c Done reducing gpu clause db, clause count is 403771
c There is 11421286400 free memory out of 11996954624
c
c Solver 10: try to adapt solver strategies at 19.904424. There are 18142 learned clauses and 2298 permanently learned
c. 128222 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 2: try to adapt solver strategies at 20.444165. There are 18916 learned clauses and 2623 permanently learned
c. 128722 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 8: try to adapt solver strategies at 20.504621. There are 18757 learned clauses and 2405 permanently learned
c. 126798 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 4: try to adapt solver strategies at 20.692872. There are 17955 learned clauses and 2504 permanently learned
c. 128441 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 6: try to adapt solver strategies at 20.805395. There are 18390 learned clauses and 2562 permanently learned
c. 129321 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c
c Solver 0: try to adapt solver strategies at 20.829029. There are 18153 learned clauses and 2719 permanently learned
c. 128332 clauses were removed 
c Adjusting for low successive conflicts.
c
c Activating Chanseok Strategy: moved 0 clauses to the permanent set.
c Reducing gpu clause db, clause count is 1315716
c There is 11325865984 free memory out of 11996954624
c Keeping clauses with act >= 78081.7
c Done reducing gpu clause db, clause count is 659219
c There is 11317477376 free memory out of 11996954624
c Reducing gpu clause db, clause count is 1521626
c There is 11335303168 free memory out of 11996954624
c Keeping clauses with act >= 2.0616e+08
c Done reducing gpu clause db, clause count is 761828
c There is 11354177536 free memory out of 11996954624
c Reducing gpu clause db, clause count is 1669553
c There is 11213668352 free memory out of 11996954624
c Keeping clauses with act >= 7.21443e+11

The /usr/bin/time --verbose output on this one is:

Command exited with non-zero status 10
        Command being timed: "./doalarm -t real 5000 ./glucose-gpu -write-stats-period-sec=-1 -max-thread-count=12 -max-memory=45000 -mem-lim=50000 001-80-12-sc2014.cnf"
        User time (seconds): 7382.54
        System time (seconds): 226.03
        Percent of CPU this job got: 862%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 14:41.90
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3016148
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1530121
        Voluntary context switches: 386445
        Involuntary context switches: 1573461
        Swaps: 0
        File system inputs: 0
        File system outputs: 248
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 10

Note that I adjusted the command line and ran 12 threads, 2 of them in parallel on a single node, with a single GPU shared by the two. I did this because I wanted to make sure we don't go over our time budget for this particular run (we can increase it for future runs).

Here are all the assert failures we get from a run on SAT'2019 instances:

[matesoos@nscc03 out-2631131.wlm01-0-drat0]$ xzgrep -i "assert" *.out.xz
003-sc2013.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
003-sc2013.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 508070232  and 500000000  
289-unsat-21x14-sc2011.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
289-unsat-21x14-sc2011.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 503738120  and 500000000  
6s166-sc2013.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
6s166-sc2013.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 605562776  and 500000000  
Haystacks-ext-12_c18.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
Haystacks-ext-12_c18.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 1513667184  and 500000000  
Haystacks-ext-13_c18.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
Haystacks-ext-13_c18.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 804146928  and 500000000  
MM-23-2-2-2-2-3.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
MM-23-2-2-2-2-3.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 510282416  and 500000000  
Pb-chnl15-16_c18.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
Pb-chnl15-16_c18.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 1097877168  and 500000000  
QG7a-gensys-icl001.sat05-3822.reshuffled-07-sc2007.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
QG7a-gensys-icl001.sat05-3822.reshuffled-07-sc2007.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 723898112  and 500000000  
eqsparcl12bpwtrc12.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
eqsparcl12bpwtrc12.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 1004844840  and 500000000  
hcp_d20_20.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
hcp_d20_20.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 540988352  and 500000000  
k_unsat-sc2013.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
k_unsat-sc2013.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 873798944  and 500000000  
le450_15b.col.15-sc2018.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
le450_15b.col.15-sc2018.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 545147544  and 500000000  
mod2c-3cage-unsat-10-2.sat05-2567.reshuffled-07-sc2007.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
mod2c-3cage-unsat-10-2.sat05-2567.reshuffled-07-sc2007.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 523740320  and 500000000  
mod2c-3cage-unsat-10-3.sat05-2568.reshuffled-07-sc2007.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
mod2c-3cage-unsat-10-3.sat05-2568.reshuffled-07-sc2007.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 612333624  and 500000000  
ndhf_xits_16_UNKNOWN-sc2009.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
ndhf_xits_16_UNKNOWN-sc2009.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 1123777416  and 500000000  
rbcl_xits_11_UNKNOWN-sc2009.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
rbcl_xits_11_UNKNOWN-sc2009.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 530837360  and 500000000  
sha1r17m147ABCD.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
sha1r17m147ABCD.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 505803384  and 500000000  
size_5_5_5_i171_r12.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
size_5_5_5_i171_r12.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 702172064  and 500000000  
slp-synthesis-aes-top21-sc2011.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
slp-synthesis-aes-top21-sc2011.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 508510056  and 500000000  
stone-width3chain-nmarkers-20_shuffled-sc2016.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
stone-width3chain-nmarkers-20_shuffled-sc2016.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 629348952  and 500000000  
sv-comp19_prop-reachsafety.triangular-longer_false-unreach-call.i-witness.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
sv-comp19_prop-reachsafety.triangular-longer_false-unreach-call.i-witness.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 692878536  and 500000000  
tseitingrid7x160_shuffled-sc2016.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
tseitingrid7x160_shuffled-sc2016.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 521156496  and 500000000  
unsat-set-a-fclqcolor-20-13-14.sat05-1278.reshuffled-07-sc2007.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
unsat-set-a-fclqcolor-20-13-14.sat05-1278.reshuffled-07-sc2007.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 579276464  and 500000000  
x2_128.shuffled-as.sat03-1598-sc2002.cnf.gz.out.xz:glucose-gpu: /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: Glucose::DArr<T>::DArr(size_t, T*, Glucose::DestrCheckPointer) [with T = char; size_t = long unsigned int]: Assertion `false' failed.
x2_128.shuffled-as.sat03-1598-sc2002.cnf.gz.out.xz:Error in /home/projects/11000744/matesoos/gpu-share-sat/gpu-share-sat/gpu/CorrespArr.cuh:142: assertion failed: (_size) < (500000000) message:  values are 576391032  and 500000000

I hope my command line parameter changes (dividing everything by 2) is OK?

Cheers,

Mate

PS: I was running this under:

memlimit="46000000"
ulimit -v $memlimit

As I usually do, to make sure things don't go astray.

nicolasprevot commented 3 years ago

I can reproduce locally. I will investigate.

Looks like it only happens when we reduce the gpu clause db. My bad. I usually run the non-release on only a few instances, and I only run the release version (without assertions, much more performant) on a larger number of instances. I will run the non-release on more instances.

msoos commented 3 years ago

Ahhh, sorry, I forgot I should have compiled the release version! My bad :( Ok, will do that once you fix this issue! Sorry again!

nicolasprevot commented 3 years ago

I have fixed an actual non-release bug. The one you mentioned was an assertion that a device array would would not have a size bigger than 500 millions. I guess that when I added that line, 500 millions seemed really huge. But I suspect it could happen with arrays with a type of size 1. We use such arrays when copying the new clauses to the GPU as well as sending the updates to the assignments. So I've increased the limit.

Thanks for your help!

msoos commented 3 years ago

Thanks for the fix! I have now made sure to pull the update and build a release version, so we get the extra speed as well :) I scheduled a run, I'll get back to you with the results. It seems that we will be able to schedule one 24-thread run on 30 nodes for about 12-15h, to do a full SAT competition run. I'm expecting somewhere around 7000 hours of CPU core time/run. We should be able to burn at least 200k hours, so we should be able to do about 30 full runs to try out different options. We can schedule them "stacked", i.e. we can put 3-5 into the queue and they will be ran that way, perhaps in parallel (on different 30 nodes), or sequentially. So we should be able to get some data back, tune, etc.

Thanks again for the quick fix, I'll get back to you with the results,

Mate

nicolasprevot commented 3 years ago

Hi

I think that the gpu version will do better with 32 threads than with 24. (on my desktop, it did better with 32 threads than with 16, on a 16 cores machine). We're limited to 32 threads though.

So I think the commands we can run are:

glucose-gpu_release -write-stats-period-sec=-1 -max-thread-count=24 -max-memory=90000 -mem-lim=100000 glucose-gpu_release -write-stats-period-sec=-1 -max-thread-count=32 -max-memory=90000 -mem-lim=100000 glucose-gpu_release -write-stats-period-sec=-1 -max-thread-count=32 -gpu-inc-reduce-db=60000 -max-memory=90000 -mem-lim=100000 glucose-gpu_release -write-stats-period-sec=-1 -max-thread-count=32 -gpu-inc-reduce-db=20000 -max-memory=90000 -mem-lim=100000

glucose-syrup_release -maxnbthreads=24 -maxmemory=90000 -mem-lim=100000 glucose-syrup_release -maxnbthreads=32 -maxmemory=90000 -mem-lim=100000

It's really exciting to be able to test a full competition run in half a day rather than weeks, thanks!

If we want to do more run, maybe it would be possible to to run it on the sat 2019 competition?

Nicolas

nicolasprevot / GpuShareSat

Assert failures at CorrespArr.cuh:142 #6