google / or-tools

Google's Operations Research tools:
https://developers.google.com/optimization/
Apache License 2.0
11.25k stars 2.13k forks source link

Multithreading: solving different problems with CP-SAT ends in random crash #1958

Closed patr0805 closed 4 years ago

patr0805 commented 4 years ago

What version of OR-tools and what language are you using? Version: 7.5.7466 Language: C#

Which solver are you using (e.g. CP-SAT, Routing Solver, GLOP, BOP, Gurobi) CP-SAT What operating system (Linux, Windows, ...) and version? Windows What did you do? If I run two threads A and B. Lets say A will try to solve problem #1 and B will try to solve problem #2. Lets say the problem remains the same so input model and output is always the same. If I make it so that these threads will try solving these same issues in an endless loop... it will work for a couple of minutes, but then CpSolver.solve(model) suddenly causes my entire application to crash. If I place the solve(model) inside a static lock like this it will work (where cpSolverLock is a static object variable defined in the class):

                var solver = new CpSolver();
                var status = CpSolverStatus.Unknown;
                lock (cpSolverLock)
                {
                    status = solver.Solve(model);
                }

What did you expect to see I expect that an instance of CpSolver is unaffected by another instance of CpSolver running on another thread. This is especially so when nothing I ever provide to CpSolver, CpModel or just ortools classes/methods and so on is accessed by multiple threads at all.

What did you see instead? Crash with the following error: The program '[30320] AutomaticCapacityTester.exe' has exited with code -1073740940 (0xc0000374).

Make sure you include information that can help us debug (full error message, model Proto). Tell me if you need some sample model, but the nature of the problem doesn't feel model specific.

lperron commented 4 years ago

Strange. The C++ solver is totally MT safe. It seems an issue around the swig/interop layer. Laurent Perron | Operations Research | lperron@google.com | (33) 1 42 68 53 00

Le ven. 10 avr. 2020 à 03:50, patr0805 notifications@github.com a écrit :

What version of OR-tools and what language are you using? Version: 7.5.7466 Language: C#

Which solver are you using (e.g. CP-SAT, Routing Solver, GLOP, BOP, Gurobi) CP-SAT What operating system (Linux, Windows, ...) and version? Windows What did you do? If I run two threads A and B. Lets say A will try to solve problem #1 https://github.com/google/or-tools/pull/1 and B will try to solve problem #2 https://github.com/google/or-tools/pull/2. Lets say the problem remains the same so input model and output is always the same. If I make it so that these threads will try solving these same issues in an endless loop... it will work for a couple of minutes, but then CpSolver.solve(model) suddenly causes my entire application to crash. If I place the solve(model) inside a static lock like this it will work (where cpSolverLock is a static object variable defined in the class):

            var solver = new CpSolver();
            var status = CpSolverStatus.Unknown;
            lock (cpSolverLock)
            {
                status = solver.Solve(model);
            }

What did you expect to see I expect that an instance of CpSolver is unaffected by another instance of CpSolver running on another thread. This is especially so when nothing I ever provide to CpSolver, CpModel or just ortools classes/methods and so on is accessed by multiple threads at all.

What did you see instead? Crash with the following error: The program '[30320] AutomaticCapacityTester.exe' has exited with code -1073740940 (0xc0000374).

Make sure you include information that can help us debug (full error message, model Proto). Tell me if you need some sample model, but the nature of the problem doesn't feel model specific.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/or-tools/issues/1958, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUPL3KHRWXOJTDFSL77SELRLZ3PPANCNFSM4MFE3Q2A .

patr0805 commented 4 years ago

I have tried to make a minimal sample, but ran into another problem there that was much more apparent when I added num_search_workers.. it might be related since both seem to be memory issues. I will try to make a sample code that result in 0xc0000374.

I also wonder why simply doubling the size of capacities array makes the solver EXTREMELY slow, but thats a seperate issue I hope can be solved as I otherwise think I have to think away from integer programming for performance reasons.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using Google.OrTools.Sat;

namespace MultiThreadingBugReproducer
{
    class Program
    {
        static void Main(string[] args)
        {
            for (var amount = 0; amount < 2; amount++)
            {
                var intialAmount = amount;
                Thread thread0 = new Thread(() => Thread0Task(intialAmount * 3));
                thread0.Start();
            }

            Console.Read();
        }

        private static void Thread0Task(int amount)
        {
            var iteration = 0;
            while (true)
            {
                var model = new CpModel();
                var capacities = new int[] { 223, 123, 236, 9, 12, 5, 2, 7, 2};
                capacities = capacities.Take(capacities.Count()).ToArray();
                var reservationAmount = capacities.Sum() - 8;
                var remainingCapacityExpressions = new List<LinearExpr>();
                var reservationVars = new List<IntVar>();
                for (var index = 0; index < capacities.Length; index++)
                {
                    var resVar = model.NewIntVar(0, int.MaxValue, $"res_{index}");
                    reservationVars.Add(resVar);

                    var remainingCapacityExpression = capacities[index] - resVar;
                    remainingCapacityExpressions.Add(remainingCapacityExpression);
                }

                model.Add(LinearExpr.Sum(reservationVars) == reservationAmount);

                var lowestRemainingCapacityInPairsVars = new List<IntVar>();
                for (var index = 0; index < capacities.Length; index++)
                {
                    for (var index2 = index + 1; index2 < capacities.Length; index2++)
                    {
                        var remainingCapacityPairA = remainingCapacityExpressions[index];
                        var remainingCapacityPairB = remainingCapacityExpressions[index2];

                        var minVar = model.NewIntVar(int.MinValue, int.MaxValue, $"h_{index}_{index2}");

                        // Objective maximization will force all minVar to be as high as allowed, achiving minVar equal minimum of numbers provided.
                        model.Add(minVar <= remainingCapacityPairA);
                        model.Add(minVar <= remainingCapacityPairB);

                        lowestRemainingCapacityInPairsVars.Add(minVar);
                    }
                }

                model.AddDecisionStrategy(lowestRemainingCapacityInPairsVars, DecisionStrategyProto.Types.VariableSelectionStrategy.ChooseHighestMax, DecisionStrategyProto.Types.DomainReductionStrategy.SelectMaxValue);
                model.Maximize(LinearExpr.Sum(lowestRemainingCapacityInPairsVars));

                // Creates a solver and solves the model.
                var solver = new CpSolver();
                solver.StringParameters = "num_search_workers:2";
                var status = solver.Solve(model);
                Console.WriteLine($"Solved: iteration: " + iteration++);
            }
        }
    }
}
CADBIMDeveloper commented 4 years ago

Hi guys, I have the same issue of heap corruption (0xc0000374 error). running the solver in MT. I also tried the code above. It was working good on my machine until I increased the number of threads. Maybe this could be useful for debugging,

Thank you

lperron commented 4 years ago

Are you hitting the memory limit ?

CADBIMDeveloper commented 4 years ago

Yes, but not in this case. It seems in such case I face with SEHException. This is actually my another question. Is there any way to control (or limit) the memory usage of the solver? It is good for me to interrupt the task if it hits some memory threshold.

So about heap corruption. It happens randomly on the same bunch of rather small tasks.

One another interesting thing. I tried to call GC.TryStartNoGCRegion before launching the solvers, It seems that the probability of the crash becomes lower.

lperron commented 4 years ago

Seems related to https://github.com/Azure/azure-service-bus-dotnet/issues/306

CADBIMDeveloper commented 4 years ago

I faced with this issue on Windows 10, .Net framework v 4.7...

patrick-mills commented 4 years ago

Hi Laurent, I am having a similar kind of issue, however in Java this time:

Stack: [0x0000009b19d00000,0x0000009b19e00000], sp=0x0000009b19dfe7e0, free space=1017k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [ntdll.dll+0xa5f58] C [ntdll.dll+0x3fb91] C [ucrtbase.dll+0x114cb] C [jniortools.dll+0x49d369]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) J 1877 com.google.ortools.sat.mainJNI.SatHelper_solveWithParameters([B[B)[B (0 bytes) @ 0x0000027787c3c536 [0x0000027787c3c4c0+0x0000000000000076] J 1872 c1 com.google.ortools.sat.SatHelper.solveWithParameters(Lcom/google/ortools/sat/CpModelProto;Lcom/google/ortools/sat/SatParameters;)Lcom/google/ortools/sat/CpSolverResponse; (39 bytes) @ 0x0000027780b44884 [0x0000027780b446c0+0x00000000000001c4] J 1858 c1 com.google.ortools.sat.CpSolver.solve(Lcom/google/ortools/sat/CpModel;)Lcom/google/ortools/sat/CpSolverStatus; (26 bytes) @ 0x0000027780b3ede4 [0x0000027780b3ea20+0x00000000000003c4] j test.ThreadingTest.solve(II)V+175 j test.ThreadingTest.access$0(Ltest/ThreadingTest;II)V+3 j test.ThreadingTest$1.run()V+13 j java.lang.Thread.run()V+11 java.base@11.0.7 v ~StubRoutines::call_stub

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x00000000f4985132

lperron commented 4 years ago

Can you send me the model ? the code ? Laurent Perron | Operations Research | lperron@google.com | (33) 1 42 68 53 00

Le ven. 24 avr. 2020 à 09:09, patrick-mills notifications@github.com a écrit :

Hi Laurent, I am having a similar kind of issue, however in Java this time:

Stack: [0x0000009b19d00000,0x0000009b19e00000], sp=0x0000009b19dfe7e0, free space=1017k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [ntdll.dll+0xa5f58] C [ntdll.dll+0x3fb91] C [ucrtbase.dll+0x114cb] C [jniortools.dll+0x49d369]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) J 1877 com.google.ortools.sat.mainJNI.SatHelper_solveWithParameters([B[B)[B (0 bytes) @ 0x0000027787c3c536 [0x0000027787c3c4c0+0x0000000000000076] J 1872 c1 com.google.ortools.sat.SatHelper.solveWithParameters(Lcom/google/ortools/sat/CpModelProto;Lcom/google/ortools/sat/SatParameters;)Lcom/google/ortools/sat/CpSolverResponse; (39 bytes) @ 0x0000027780b44884 [0x0000027780b446c0+0x00000000000001c4] J 1858 c1 com.google.ortools.sat.CpSolver.solve(Lcom/google/ortools/sat/CpModel;)Lcom/google/ortools/sat/CpSolverStatus; (26 bytes) @ 0x0000027780b3ede4 [0x0000027780b3ea20+0x00000000000003c4] j test.ThreadingTest.solve(II)V+175 j test.ThreadingTest.access$0(Ltest/ThreadingTest;II)V+3 j test.ThreadingTest$1.run()V+13 j java.lang.Thread.run()V+11 java.base@11.0.7 v ~StubRoutines::call_stub

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x00000000f4985132

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/or-tools/issues/1958#issuecomment-618843468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUPL3IU42GRIRQJ3PNPUBTROE3KTANCNFSM4MFE3Q2A .

patrick-mills commented 4 years ago

Below is my JUnit test which reproduces the issue: JDK is 11.07, JUnit 5, OR-Tools 7.5. [My PC spec is Intel Core i9 with 8 cores, 32GB of ram and I am running Windows 10 Home 64-bit]. Hope that helps, Patrick

`package test;

import org.junit.jupiter.api.Test;

import com.google.ortools.sat.CpModel; import com.google.ortools.sat.CpSolver; import com.google.ortools.sat.CpSolverStatus; import com.google.ortools.sat.IntVar;

class ThreadingTest {

int n = 5;
int maxIts = 1000000;
double maxTime = 1.0;

static {
    System.loadLibrary("jniortools");
}

@Test
void testMultiThread() {
    runSolver(8);
}

void runSolver(int nThreads) {
    Object obj = new Object();

    for (int t = 0; t < nThreads; t++) {
        new Thread(new Runnable() {
            @Override
            public void run() {
                solve(8, maxIts);
                synchronized(obj) {
                    obj.notifyAll();
                }
            }

        }).start();
    }
    int running = nThreads;
    while (running > 0) {
        try {
            synchronized(obj) {
                obj.wait();
            }
        } catch (InterruptedException e) {

        }
        running--;
    }
    System.out.println("\nFinished");
}

private void solve(int n, int maxIts) {
    for (int its = 0; its < maxIts; its++) {
        CpModel model = createModel();

        IntVar queens[] = new IntVar[n];
        for (int i = 0; i < n; i++) {
            queens[i] = model.newIntVar(0, n-1, "Queen"+i);

        }
        model.addAllDifferent(queens);
        for (int i = 0; i < n; i++) {
            for (int j = i+1; j < n; j++) {
                model.addDifferentWithOffset(queens[i], queens[j], j - i);
                model.addDifferentWithOffset(queens[i], queens[j], i - j);
            }
        }

        CpSolver solver = new CpSolver();
        solver.getParameters().setMaxTimeInSeconds(maxTime);
        CpSolverStatus status = solver.solve(model);

        if (status == CpSolverStatus.FEASIBLE) {
            System.out.println("Solution found in: "+solver.wallTime()+" seconds.");
        }
        else {
            System.out.println(status);
        }
    }
}

private CpModel createModel() {
    return new CpModel();
}

}`

lperron commented 4 years ago

Thank you all. I rewrote this in C++ and found the problem, and the workaround. By default, CP-SAT catches SIGINT. This handler is not thread safe.

If you run multiple solver in parallel, you need to change this parameter:

lperron commented 4 years ago

Not setting the parameter will work in 7.7.

patrick-mills commented 4 years ago

Thanks very much for your help, Laurent. Best regards, Patrick

On Fri, Apr 24, 2020 at 1:51 PM Laurent Perron notifications@github.com wrote:

Not setting the parameter will work in 7.7.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/or-tools/issues/1958#issuecomment-618989362, or unsubscribe https://github.com/notifications/unsubscribe-auth/APKFAHHJ7SYQWU2Z3ZPJXVLROGDMTANCNFSM4MFE3Q2A .