google / or-tools

Google's Operations Research tools:
https://developers.google.com/optimization/
Apache License 2.0
10.77k stars 2.09k forks source link

Segmentation Fault If binary_search_num_conflicts Is Used #4006

Open jawbroken opened 7 months ago

jawbroken commented 7 months ago

What version of OR-Tools and what language are you using? Version: v9.8.3296 Language: Python

Which solver are you using (e.g. CP-SAT, Routing Solver, GLOP, BOP, Gurobi)

CP-SAT

What operating system (Linux, Windows, ...) and version?

macOS Sonoma 14.1.1

What did you do?

Setting the binary_search_num_conflicts to a non-negative value (e.g. 1) causes random segmentation faults in CP-SAT for me. I can reproduce it using the Weighted Latency Problem example code, but you need to set the problem size large enough (e.g. num_nodes = 100, grid_size = 100), remove max_time_in_seconds: 5 from the solver parameters, and add binary_search_num_conflicts: 1. It might take some time to happen (e.g. in my case it took about 110 seconds), and the stack trace is always as below.

What did you expect to see

No segmentation fault.

What did you see instead?

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               Python [32966]
Path:                  /opt/homebrew/*/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python
Identifier:            org.python.python
Version:               3.11.6 (3.11.6)
Code Type:             ARM-64 (Native)
Parent Process:        zsh [31090]
Responsible:           Electron [30943]
User ID:               501

Date/Time:             2023-11-25 22:18:42.9063 +1100
OS Version:            macOS 14.1.1 (23B81)
Report Version:        12
Anonymous UUID:        4B83107E-5C12-FA78-4D6B-7147F0F91C46

Time Awake Since Boot: 160000 seconds

System Integrity Protection: enabled

Crashed Thread:        14

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x000000000000000b
Exception Codes:       0x0000000000000001, 0x000000000000000b

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   Python [32966]

VM Region Info: 0xb is not in any region.  Bytes before following region: 68719476725
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      commpage (reserved)        1000000000-7000000000   [384.0G] ---/--- SM=NUL  ...(unallocated)

Thread 14 Crashed:
0   libsystem_kernel.dylib                 0x1886f911c __pthread_kill + 8
1   libsystem_pthread.dylib                0x188730cc0 pthread_kill + 288
2   libsystem_c.dylib                      0x188609540 raise + 32
3   Python                                 0x10121da00 faulthandler_fatal_error + 448
4   libsystem_platform.dylib               0x18875fa24 _sigtramp + 56
5   libortools.9.dylib                     0x13669c3d4 operations_research::sat::IntegerEncoder::AddImplications(absl::lts_20230802::btree_map<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_>, operations_research::sat::Literal, std::__1::less<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_>>, std::__1::allocator<std::__1::pair<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_> const, operations_research::sat::Literal>>> const&, absl::lts_20230802::container_internal::btree_iterator<absl::lts_20230802::container_internal::btree_node<absl::lts_20230802::container_internal::map_params<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_>, operations_research::sat::Literal, std::__1::less<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_>>, std::__1::allocator<std::__1::pair<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_> const, operations_research::sat::Literal>>, 256, false>> const, std::__1::pair<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_> const, operations_research::sat::Literal> const&, std::__1::pair<operations_research::StrongInt64<operations_research::sat::IntegerValue_integer_tag_> const, operations_research::sat::Literal> const*>, operations_research::sat::Literal) + 224
6   libortools.9.dylib                     0x13669d4e8 operations_research::sat::IntegerEncoder::AssociateToIntegerLiteral(operations_research::sat::Literal, operations_research::sat::IntegerLiteral) + 688
7   libortools.9.dylib                     0x13669cfec operations_research::sat::IntegerEncoder::GetOrCreateAssociatedLiteral(operations_research::sat::IntegerLiteral) + 644
8   libortools.9.dylib                     0x136726568 operations_research::sat::RestrictObjectiveDomainWithBinarySearch(operations_research::StrongIndex<operations_research::sat::IntegerVariable_index_tag_>, std::__1::function<void ()> const&, operations_research::sat::Model*) + 448
9   libortools.9.dylib                     0x1365f2540 operations_research::sat::(anonymous namespace)::SolveLoadedCpModel(operations_research::sat::CpModelProto const&, operations_research::sat::Model*) + 1392
10  libortools.9.dylib                     0x1366027fc operations_research::sat::(anonymous namespace)::LnsSolver::GenerateTask(long long)::'lambda'()::operator()() const + 2728
11  libortools.9.dylib                     0x1367b8578 std::__1::__function::__func<operations_research::sat::NonDeterministicLoop(std::__1::vector<std::__1::unique_ptr<operations_research::sat::SubSolver, std::__1::default_delete<operations_research::sat::SubSolver>>, std::__1::allocator<std::__1::unique_ptr<operations_research::sat::SubSolver, std::__1::default_delete<operations_research::sat::SubSolver>>>>&, int)::$_2, std::__1::allocator<operations_research::sat::NonDeterministicLoop(std::__1::vector<std::__1::unique_ptr<operations_research::sat::SubSolver, std::__1::default_delete<operations_research::sat::SubSolver>>, std::__1::allocator<std::__1::unique_ptr<operations_research::sat::SubSolver, std::__1::default_delete<operations_research::sat::SubSolver>>>>&, int)::$_2>, void ()>::operator()() + 52
12  libortools.9.dylib                     0x1360b39e0 operations_research::RunWorker(void*) + 112
13  libortools.9.dylib                     0x1360b42a4 void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(void*), operations_research::ThreadPool*>>(void*) + 52
14  libsystem_pthread.dylib                0x188731034 _pthread_start + 136
15  libsystem_pthread.dylib                0x18872be3c thread_start + 8

Anything else we should know about your project / environment

I don't think I can reproduce it with OR-Tools v9.7.2996 (though it's hard to tell because the crash isn't deterministic).

lperron commented 7 months ago

Thanks. We will investigate. I can reproduce the crash.

Note that you should just not use this flag. There are better techniques for the same effect.

jawbroken commented 7 months ago

Thanks for looking at it. I mainly reported it because it seemed from the stack trace like it might be a deeper issue that was just more likely to trigger with this flag.

I was just playing around with the parameter because it seemed to help improve lower bounds faster on my problem, but I haven't done any rigorous testing. What are the better techniques?

lperron commented 7 months ago

With more workers, objective_lb_search and objective_shaving_search are very good. Laurent Perron | Operations Research | @.*** | (33) 1 42 68 53 00

Le lun. 27 nov. 2023 à 13:15, jawbroken @.***> a écrit :

Thanks for looking at it. I mainly reported it because it seemed from the stack trace like it might be a deeper issue that was just more likely to trigger with this flag.

I was just playing around with the parameter because it seemed to help improve lower bounds faster on my problem, but I haven't done any rigorous testing. What are the better techniques?

— Reply to this email directly, view it on GitHub https://github.com/google/or-tools/issues/4006#issuecomment-1827719332, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUPL3MSHK7LZE54FF3PLG3YGR75JAVCNFSM6AAAAAA72BCLRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRXG4YTSMZTGI . You are receiving this because you commented.Message ID: @.***>

jawbroken commented 7 months ago

Great, thanks. I only have 12 cores and didn't appreciate that increasing the number of workers (or just customising the subsolvers) might be a big improvement despite the increased contention. No wonder binary_search_num_conflicts seemed helpful.

For reference, if anyone else stumbles across this, I get lb_tree_search at 13 workers, probing at 14, objective_lb_search at 15, objective_shaving_search_no_lp at 17, objective_shaving_search_max_lp at 19, probing_max_lp at 21, objective_lb_search_no_lp at 23 and objective_lb_search_max_lp at 25, in OR-Tools v9.8.3296. I'll look into perhaps customising the list of subsolvers to the ones that seem most useful for my problem.