msoos / cryptominisat

An advanced SAT solver
https://www.msoos.org
Other
817 stars 181 forks source link

Free-ing of watchlists is very slow on multi-threaded solving #487

Closed capiman closed 5 years ago

capiman commented 6 years ago

When you let CMS run and especially with big CNF and a lot of memory, I think there could be a way to speed up calculation. This is nothing new, I think it was already in there (in Minisat?) or early versions of CMS.

I think it is because destructors clean up memory/clauses/...

int Main::solve() { ...

ifdef FAST_EXIT

exit(4711);

endif // FAST_EXIT

return correctReturnValue(ret);

}

This is not the correct way, because it destroys the return value, but something like

ifdef FAST_EXIT

exit(correctReturnValue(ret));

endif // FAST_EXIT

I even can't tell why it helps. This is just a flag/a marker, that there is perhaps some easy change, where you can save some seconds (or even a lot of time depending how much memory is currently in use, perhaps even more with multiple threads).

I assume nothing for regular debugging, because I assume memory leak checkers tells lot of memory, which is not freed before program ends.

capiman commented 6 years ago

See https://github.com/msoos/cryptominisat/blob/cryptominisat-2.9.6/cmsat/Main.cpp#L799

msoos commented 6 years ago

Thanks for the heads up :) Actually, I prefer freeing memory. Maybe I'm stupid but that's how it is... I don't like hacks so much. There are other places that can really speed up the solving but this is really only 3-4 secs at most, and we have 5000s timeout. So I'd rather concentrate on things that can help 5-10% of that, i.e. 50-500 seconds :)

Things are getting better for CMS though, it seems like the SAT Competition version will be quite OK :)

capiman commented 6 years ago

Just my current test (as written in my other bugreport). I doing tests with a CNF of 4.5 GBytes. 12 threads. It pumped up to 60 Gbytes RAM. The reading in worked fine (with the "p cnf" header). No error during calculation. No wrong result. Ok, no result yet, but this was already known before, because it is still too difficult. The calculation timeout of 9000s was hit quite good: c Total time 9328.21 Last line (due to timeout) s INDETERMINATE So stat is done, red/irred files are already done. Now it takes since around 2 o'clock to free all memory. It is now almost 7 o'clock, 5 hours later. But progress: From the 60 GBytes RAM, we are meanwhile down to 15.4 GBytes. Only one thread is cleaning up? What is it doing, that takes so long? Removing clause by clause, and maintaining all other structures up-to-date?

msoos commented 6 years ago

That is another issue. It's probably still solving..... It didn't exit :( Crap. Can you attach a debugger and check where it's stuck? :) You can just use gdb and attach it to a running process:

https://stackoverflow.com/questions/2308653/can-i-use-gdb-to-debug-a-running-process

Can you please check it which thread hasn't exited yet and what it's doing? It must be stuck somewhere.... Would be nice to know before tomorrow's deadline :D

capiman commented 6 years ago

What to enter (which gdb commands) once the problem occurs and I attached gdb to the PID (tested it and it seems to work). I use a small timeout --maxtime=90. I am a bit time limited today, but I see what I can do/get in the next few minutes.

capiman commented 6 years ago

I think I am now there where I can enter commands into gdb, which is attached to CMS. But: How can it reach dumpred/dumpirred/stats, when not all threads are finished? Someone signaled finished, but has not finished?

capiman commented 6 years ago

2 threads are running in this situation after printing finals stats: I assume 1 main thread and one worker thread. EXE from buildserver is not helpful, no debug infos. I rerun with my self compiled MinGW build.

Reading symbols from D:\cryptominisat-20180407\src\cryptominisat5.exe...(no debugging symbols found)...done. Attaching to program `D:\cryptominisat-20180407\src\cryptominisat5.exe', process 5892 [New Thread 5892.0x1514] [New Thread 5892.0x22c8] (gdb) thread apply all bt

Thread 2 (Thread 5892.0x22c8):

0 0x00007ffa155d3c51 in ntdll!DbgBreakPoint () from C:\Windows\SYSTEM32\ntdll.dll

1 0x00007ffa156001db in ntdll!DbgUiRemoteBreakin () from C:\Windows\SYSTEM32\ntdll.dll

2 0x00007ffa14071fe4 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll

3 0x00007ffa1559f061 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll

4 0x0000000000000000 in ?? ()

Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 5892.0x1514):

0 0x00007ffa155607d3 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll

1 0x00007ffa1555fc26 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll

2 0x00007ffa1555e593 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll

3 0x00007ffa15556b39 in ntdll!RtlFreeHeap () from C:\Windows\SYSTEM32\ntdll.dll

4 0x00007ff625d0c9c8 in ?? ()

Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb)

capiman commented 6 years ago

I cannot deliver the right info at the moment. Tests does not go as wanted. MinGW version takes more memory and much longer time. Half of CNF read and already over 53 Gbytes. I hope I can redo it tomorrow morning.

capiman commented 6 years ago

I hope this help (only 6 threads were used):

Reading symbols from D:\cryptominisat-20180413\src\cryptominisat64fastexitoff.exe...done.
Attaching to program `D:\cryptominisat-20180413\src\cryptominisat64fastexitoff.exe', process 9024
[New Thread 9024.0x2970]
[New Thread 9024.0x1d7c]
(gdb) thread apply all bt

Thread 2 (Thread 9024.0x1d7c):
#0  0x00007ffa155d3c51 in ntdll!DbgBreakPoint () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffa156001db in ntdll!DbgUiRemoteBreakin () from C:\Windows\SYSTEM32\ntdll.dll
#2  0x00007ffa14071fe4 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll
#3  0x00007ffa1559f061 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 9024.0x2970):
#0  0x00007ffa155607d3 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffa1555fc26 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll
#2  0x00007ffa1555e593 in ntdll!RtlAcquireSRWLockExclusive () from C:\Windows\SYSTEM32\ntdll.dll
#3  0x00007ffa15556b39 in ntdll!RtlFreeHeap () from C:\Windows\SYSTEM32\ntdll.dll
#4  0x00007ffa143d984c in msvcrt!free () from C:\Windows\System32\msvcrt.dll
#5  0x0000000000412569 in CMSat::vec<CMSat::Watched>::clear (dealloc=true, this=<optimized out>) at Vec.h:295
#6  CMSat::vec<CMSat::Watched>::~vec (this=<optimized out>, __in_chrg=<optimized out>) at Vec.h:100
#7  CMSat::vec<CMSat::vec<CMSat::Watched> >::clear (this=this@entry=0x59dc200, dealloc=true) at Vec.h:291
#8  0x0000000000412899 in CMSat::vec<CMSat::vec<CMSat::Watched> >::clear (dealloc=true, this=0x59dc200) at cnf.h:83
#9  CMSat::vec<CMSat::vec<CMSat::Watched> >::~vec (this=0x59dc200, __in_chrg=<optimized out>) at Vec.h:100
#10 CMSat::watch_array::~watch_array (this=0x59dc200, __in_chrg=<optimized out>) at watcharray.h:37
#11 CMSat::CNF::~CNF (this=0x59dbd50, __in_chrg=<optimized out>) at cnf.h:83
#12 CMSat::PropEngine::~PropEngine (this=0x59dbd50, __in_chrg=<optimized out>) at propengine.cpp:60
#13 0x0000000000448bbd in CMSat::Solver::~Solver (this=0x59dbd50, __in_chrg=<optimized out>) at solver.cpp:135
#14 0x0000000000408f60 in CMSat::CMSatPrivateData::~CMSatPrivateData (this=0x59d1070, __in_chrg=<optimized out>)
    at cryptominisat.cpp:59
#15 CMSat::SATSolver::~SATSolver (this=<optimized out>, __in_chrg=<optimized out>) at cryptominisat.cpp:136
#16 0x000000000048990b in Main::~Main (this=0x189f510, __in_chrg=<optimized out>) at main.h:54
#17 0x00000000004b6237 in main (argc=<optimized out>, argv=0x1a5f40) at main_exe.cpp:42
(gdb)
capiman commented 6 years ago

Can you give a short status? I saw https://github.com/msoos/cryptominisat/commit/c3752fc980fb2aaa44807ba9f43eada8c9eedb66 which got reverted. You had an idea/clue what it could be? I debugged in same area, and think this is a code which is called very often and could be responsible for the long time at the end. What destructor is called from the removed and reverted code? Just the free itself? Do you agree it is not a stuck calculation of one of the threads? vec::~vec seems to get called twice for each variable and each thread? 130.000 variables, 249 million clauses. Variable sz seems to be around ~ 4000 in almost each call. What is it? Clauses per variable? So perhaps 130.000 variables 2 4000 12 thread = ~1210~9times? (perhaps some factor 2 is too much in my rough calculation) Small tests show that with 1 thread "data[i].~T();" is called 498.643.698 times, with 2 threads it is called double times (exactly 997.287.396).

Another interesting thing: When calling CMS with 1 thread the cleanup goes in 20s. When calling CMS with 2 threads the cleanup takes 30min!!! Why? The above looks like to be linear, but 20s to 30min is 90 times longer. I do the same test for 3 threads, lets see what it needs...

capiman commented 6 years ago

As above: 1 thread 20s (retested 20s) for deleting solver object As above: 2 threads 30min for deleting 2 solver objects 3 threads: 1 hour for deleting 3 solver objects 4 threads: 1 hours 40 minutes for deleting 4 solver objects (698s=11min for deleting first solver, 2135s=35min for deleting second solver, 2322s=38min for third and 836s=14min for the last one)

capiman commented 6 years ago

Where I have added the measurement:

        ~CMSatPrivateData()
        {
            for(Solver* this_s: solvers) {
                const double myTime = cpuTime();
                printf("Call delete solver\n");
                delete this_s;
                const double time_used = cpuTime() - myTime;
                printf("Back from delete solver %g\n", time_used);
            }
            if (must_interrupt_needs_delete) {
                delete must_interrupt;
            }

            delete log; //this will also close the file
            delete shared_data;
        }
msoos commented 6 years ago

Something is off and I have no idea what. I'm sorry. I cannot reproduce this with smaller problems and I only have 16GB of RAM on my personal machine. I will take a peek on a machine with a lot more RAM later, maybe tomorrow -- but I need the problem for that. Can you please upload the offending CNF so I can take a peek at it? Please send the link over personal email so it's not make public. Thanks!

capiman commented 6 years ago

See your personal email

capiman commented 6 years ago

Were you able to reproduce the behaviour on your machine?

msoos commented 6 years ago

I'm sorry, I'm working really hard. I got the ZIP and I got the CNF, I will reproduce it in the weekend. You should see, I am putting in 12+ hours/day. I will get it done though.

capiman commented 6 years ago

No hurry, no pressure, you do a good job! Just wanted to know, if it is a problem, which occurs only on my computer or if you can reproduce the problem also on your side.

msoos commented 6 years ago

Man, so somehow vec<> destruction is slow. It shouldn't be that slow. I can't do this now. If you can fix this yourself that'd be nice. Note that exit(0) or something is a hack and will e.g. not correctly close the file handle on the SQLite database or the DRAT file, leading to serious corruption of these files, meaning incorrect data dumped and incorrect proof generated. That's not acceptable for most people unfortunately :(

cipherboy commented 6 years ago

Mate,

I think this note might explain where some of the vector/vec performance difference comes from: Line 343. std::vector does not deallocate internal elements, whereas CMS's vec implementation does. If CMS is going for clean exit, then it should free vec's elements, otherwise, perhaps a fast exit path (clear_exit?) which frees data (but not its internal elements) and sets it to NULL would fix this issue.

msoos commented 6 years ago

Not true. That deallocates. See https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a01048_source.html#l00093

msoos commented 6 years ago

OK, I think I might have fixed this. I think @cipherboy was right... sorry! I think std::vector somehow detects types that don't need to call dealloc. I thought that all compilers do this through their optimisation passes, and I think gcc&clang actually do. But MSVC may not and so it doesn't optimize that loop away. That is crappy. So I fixed it manually. Very brittle and crappy solution, but it works :) Unfortunately, I cannot use vector because that has implications on memory usage -- double memory use, in fact, for problems with few clauses and a large number of variables :(

msoos commented 6 years ago

@capiman Can you please check if it's fixed now? I am curious... Thanks in advance!

msoos commented 5 years ago

I am closing, as this is very old. MSVC may have an issue/bug :( I can't fix MSVC though, so Microsoft will have to do it. I believe TBB (threading building blocks) might fix this on MSVC, though.