Closed lothran closed 8 months ago
Hi @lothran,
Can you give us more information on the instance sizes (nodes, edges, and number of pins) and configuration that you use. Do you use the C or Python interface? If you use the default configuration, we rarely see running times up to several hours. This could be a potential bug on your side. Is it possible to share one of the instances with us?
We can introduce a time limit for partitioning that terminates our refinement algorithms. However, this only works if we were already able to compute an initial partition. Otherwise, running times could be still higher than the time limit.
Best
@kittobi1992
I am using the C-interface with preset HIGHEST_QUALITY
with imbalance set to 0.3. Instances a generated dynamically on the fly with the goal of finding decent cuts.
One very small bad instance had 17V and 13E which basically never halts,
so its likely that there is just no valid partition. Is there any way to check this beforehand?
For me a timeout based on problem size is an easy fix for now,
but its still annoying that it can just randomly never halt for some problems.
If this sounds like a bug i can try to log instances while they are generated and send them.
Hi @lothran,
this sounds like a bug. An instance with 17 nodes and 13 edges should be partitioned instantly even in our highest-quality configuration. To resolve this, it would be helpful if you can send us an instance with a minimal example code how you call the library interface.
@kittobi1992
I will try to construct a minimal example.
Though the problem does not occur on individual instances, rather on multiple rapid invocations so its difficult to reproduce since it seems to be synchronization related and happens randomly.
Waiting between partitions fixes it mostly. The mt_kahypar_partition call gets stuck on tbb::detail::d1::task_group_base::wait().
Until I find a small enough example;
Full stack trace when it gets stuck:
#7 0x00007ffff5012817 in tbb::detail::r1::wait(tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) ()
from /opt/intel/oneapi/tbb/latest/lib/intel64/gcc4.8/libtbb.so.12
#8 0x00007ffff79cc1de in tbb::detail::d1::task_group_base::wait() [clone .isra.0] ()
from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#9 0x00007ffff7a00b99 in mt_kahypar::Pool<mt_kahypar::DynamicHypergraphTypeTraits>::bipartition(mt_kahypar::ds::PartitionedHypergraph<mt_kahypar::ds::Dyn
amicHypergraph, mt_kahypar::ds::ConnectivityInfo>&, mt_kahypar::Context const&, bool) ()
from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#10 0x00007ffff7b612c6 in mt_kahypar::DynamicHypergraphTypeTraits::PartitionedHypergraph mt_kahypar::(anonymous namespace)::multilevel_partitioning<mt_kah
ypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::Hypergraph&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*, bool)
() from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#11 0x00007ffff7b7334e in mt_kahypar::Multilevel<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Conte
xt const&, mt_kahypar::TargetGraph const*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#12 0x00007ffff7bb6495 in void mt_kahypar::rb::recursive_bipartitioning<mt_kahypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::
PartitionedHypergraph&, mt_kahypar::Context const&, int, int, mt_kahypar::OriginalHypergraphInfo const&, std::vector<unsigned char, tbb::detail::d1::scala
ble_allocator<unsigned char> >&) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#13 0x00007ffff7bb6f33 in mt_kahypar::RecursiveBipartitioning<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::PartitionedHypergraph<mt
_kahypar::ds::DynamicHypergraph, mt_kahypar::ds::ConnectivityInfo>&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*) ()
from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#14 0x00007ffff7b6126f in mt_kahypar::DynamicHypergraphTypeTraits::PartitionedHypergraph mt_kahypar::(anonymous namespace)::multilevel_partitioning<mt_kah
ypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::Hypergraph&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*, bool)
() from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#15 0x00007ffff7b7334e in mt_kahypar::Multilevel<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Conte
xt const&, mt_kahypar::TargetGraph const*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#16 0x00007ffff7b305f5 in mt_kahypar::Partitioner<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Cont
--Type <RET> for more, q to quit, c to continue without paging--c
ext&, mt_kahypar::TargetGraph*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#17 0x00007ffff7b56f4c in mt_kahypar_partitioned_hypergraph_t mt_kahypar::internal::partition<mt_kahypar::DynamicHypergraphTypeTraits>(mt_kahypar_hypergra
ph_t, mt_kahypar::Context&, mt_kahypar::TargetGraph*) [clone .isra.0] () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#18 0x00007ffff6dfc42c in mt_kahypar_partition () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
Full kahypar log:
*******************************************************************************
* Partitioning Context *
*******************************************************************************
Partitioning Parameters:
Hypergraph:
Mode: direct_kway
Objective: cut
Gain Policy: cut
Input File Format: hMetis
Instance Type: hypergraph
Preset Type: highest_quality
Partition Type: n_level_hypergraph_partitioning
k: 2
epsilon: 0.05
seed: 0
Number of V-Cycles: 0
Ignore HE Size Threshold: 1000
Large HE Size Threshold: 100
-------------------------------------------------------------------------------
Preprocessing Parameters:
Use Community Detection: true
Disable C. D. for Mesh Graphs: true
Community Detection Parameters:
Edge Weight Function: degree
Maximum Louvain-Pass Iterations: 5
Minimum Vertex Move Fraction: 0.01
Vertex Degree Sampling Threshold: 200000
Number of subrounds (deterministic): 16
-------------------------------------------------------------------------------
Coarsening Parameters:
Algorithm: nlevel_coarsener
Use Adaptive Edge Size: true
Max Allowed Weight Multiplier: 1
Maximum Allowed Hypernode Weight: 1
Contraction Limit Multiplier: 160
Deep ML Contraction Limit Multi.: 4294967295
Contraction Limit: 320
Minimum Shrink Factor: 1.01
Maximum Shrink Factor: 100
Vertex Degree Sampling Threshold: 200000
Number of subrounds (deterministic):16
Rating Parameters:
Rating Function: heavy_edge
Heavy Node Penalty: no_penalty
Acceptance Policy: best_prefer_unmatched
-------------------------------------------------------------------------------
Initial Partitioning Parameters:
Initial Partitioning Mode: recursive_bipartitioning
Number of Runs: 20
Use Adaptive IP Runs: true
Min Adaptive IP Runs: 5
Perform Refinement On Best: true
Fm Refinement Rounds: 2147483647
Remove Degree-Zero HNs Before IP: true
Maximum Iterations of LP IP: 20
Initial Block Size of LP IP: 5
Initial Partitioning Refinement Parameters:
Rebalancing Algorithm: do_nothing
Refine Until No Improvement: true
Relative Improvement Threshold: 0
Maximum Batch Size: 1000
Min Border Vertices Per Thread: 0
Label Propagation Parameters:
Algorithm: label_propagation
Maximum Iterations: 5
Rebalancing: true
HE Size Activation Threshold: 100
FM Parameters:
Algorithm: kway_fm
Multitry Rounds: 5
Perform Moves Globally: false
Parallel Global Rollbacks: false
Rollback Bal. Violation Factor: 1
Num Seed Nodes: 5
Enable Random Shuffle: true
Obey Minimal Parallelism: false
Minimum Improvement Factor: -1
Release Nodes: true
Time Limit Factor: 0.25
Flow Parameters:
Algorithm: do_nothing
-------------------------------------------------------------------------------
Refinement Parameters:
Rebalancing Algorithm: simple_rebalancer
Refine Until No Improvement: true
Relative Improvement Threshold: 0.0025
Maximum Batch Size: 1000
Min Border Vertices Per Thread: 50
Label Propagation Parameters:
Algorithm: label_propagation
Maximum Iterations: 5
Rebalancing: true
HE Size Activation Threshold: 100
FM Parameters:
Algorithm: kway_fm
Multitry Rounds: 10
Perform Moves Globally: false
Parallel Global Rollbacks: false
Rollback Bal. Violation Factor: 1.25
Num Seed Nodes: 5
Enable Random Shuffle: true
Obey Minimal Parallelism: false
Minimum Improvement Factor: -1
Release Nodes: true
Time Limit Factor: 0.25
Boundary FM Parameters:
Refine Until No Improvement: true
Num Seed Nodes: 5
Obey Minimal Parallelism: true
Flow Parameters:
Algorithm: flow_cutter
Flow Scaling: 16
Maximum Number of Pins: 4294967295
Find Most Balanced Cut: true
Determine Distance From Cut: true
Parallel Searches Multiplier: 1
Number of Parallel Searches: 1
Maximum BFS Distance: 2
Min Rel. Improvement Per Round: 0.001
Time Limit Factor: 8
Skip Small Cuts: true
Skip Unpromising Blocks: true
Pierce in Bulk: true
Steiner Tree Policy: lower_bound
-------------------------------------------------------------------------------
Shared Memory Parameters:
Number of Threads: 16
Number of used NUMA nodes: 2
Use Localized Random Shuffle: false
Random Shuffle Block Size: 2
-------------------------------------------------------------------------------
********************************************************************************
* Input *
********************************************************************************
Hypergraph Information
Name :
# HNs : 72 # HEs : 94 # pins: 185 # graph edges: 27
HE size HE weight HN degree HN weight
| min= 1 | min= 1 | min= 1 | min= 1
| Q1 = 1 | Q1 = 4050 | Q1 = 2 | Q1 = 1
| med= 1 | med= 4050 | med= 2 | med= 1
| Q3 = 2 | Q3 = 4050 | Q3 = 3 | Q3 = 1
| max= 7 | max= 4050 | max= 7 | max= 1
| avg= 1.96809 | avg= 3274.66 | avg= 2.56944 | avg= 1
| sd = 1.53421 | sd = 1601.72 | sd = 1.31969 | sd = 0
********************************************************************************
* Preprocessing... *
********************************************************************************
# Communities : 24
# HNs Per Community # Internal Pins Internal Degree Sum
| min= 1 | min= 1 | min= 1
| Q1 = 1 | Q1 = 3 | Q1 = 3
| med= 2 | med= 4 | med= 4
| Q3 = 4 | Q3 = 11 | Q3 = 11
| max= 11 | max= 27 | max= 27
| avg= 3 | avg= 7.70833 | avg= 7.70833
| sd = 2.51949 | sd = 6.27625 | sd = 6.27625
********************************************************************************
* Coarsening... *
********************************************************************************
Hypergraph Information
Name : Coarsened Hypergraph
# HNs : 72 # HEs : 94 # pins: 185 # graph edges: 27
HE size HE weight HN degree HN weight
| min= 1 | min= 1 | min= 1 | min= 1
| Q1 = 1 | Q1 = 4050 | Q1 = 2 | Q1 = 1
| med= 1 | med= 4050 | med= 2 | med= 1
| Q3 = 2 | Q3 = 4050 | Q3 = 3 | Q3 = 1
| max= 7 | max= 4050 | max= 7 | max= 1
| avg= 1.96809 | avg= 3274.66 | avg= 2.56944 | avg= 1
| sd = 1.53421 | sd = 1601.72 | sd = 1.31969 | sd = 0
********************************************************************************
* Initial Partitioning... *
********************************************************************************
Thanks!
Ok, I did some more digging turns out not using the new oneTBB fixes it (-DKAHYPAR_DOWNLOAD_TBB=On).
Hi @lothran,
thanks for the information. I will try to reproduce the bug today. I will give you an update if I come up with a fix for it.
Hi @lothran,
I was not able to reproduce your bug yet. However, I found a bug in our library interface that causes undefined behaviour on our side. I am confident that this also fixes the deadlock in your application. If you find time you can checkout the latest state on master and test if this fixes your issue.
Hey, I am trying to use this as part of a larger application that is somewhat time constraint. The partitioning is usually fast enough but sometimes stalls for hours. Is there a way to set a timeout and just use the best partition found so far? This can even happen on different runs when the input graph is the same, is that normal behavior or am I doing anything wrong? Thanks!