kahypar / mt-kahypar

Mt-KaHyPar (Multi-Threaded Karlsruhe Hypergraph Partitioner) is a shared-memory multilevel graph and hypergraph partitioner equipped with parallel implementations of techniques used in the best sequential partitioning algorithms. Mt-KaHyPar can partition extremely large hypergraphs very fast and with high quality.
MIT License
127 stars 26 forks source link

Best effort partitions? #157

Closed lothran closed 8 months ago

lothran commented 1 year ago

Hey, I am trying to use this as part of a larger application that is somewhat time constraint. The partitioning is usually fast enough but sometimes stalls for hours. Is there a way to set a timeout and just use the best partition found so far? This can even happen on different runs when the input graph is the same, is that normal behavior or am I doing anything wrong? Thanks!

kittobi1992 commented 1 year ago

Hi @lothran,

Can you give us more information on the instance sizes (nodes, edges, and number of pins) and configuration that you use. Do you use the C or Python interface? If you use the default configuration, we rarely see running times up to several hours. This could be a potential bug on your side. Is it possible to share one of the instances with us?

We can introduce a time limit for partitioning that terminates our refinement algorithms. However, this only works if we were already able to compute an initial partition. Otherwise, running times could be still higher than the time limit.

Best

lothran commented 1 year ago

@kittobi1992 I am using the C-interface with preset HIGHEST_QUALITY with imbalance set to 0.3. Instances a generated dynamically on the fly with the goal of finding decent cuts. One very small bad instance had 17V and 13E which basically never halts, so its likely that there is just no valid partition. Is there any way to check this beforehand? For me a timeout based on problem size is an easy fix for now, but its still annoying that it can just randomly never halt for some problems. If this sounds like a bug i can try to log instances while they are generated and send them.

kittobi1992 commented 1 year ago

Hi @lothran,

this sounds like a bug. An instance with 17 nodes and 13 edges should be partitioned instantly even in our highest-quality configuration. To resolve this, it would be helpful if you can send us an instance with a minimal example code how you call the library interface.

lothran commented 1 year ago

@kittobi1992
I will try to construct a minimal example. Though the problem does not occur on individual instances, rather on multiple rapid invocations so its difficult to reproduce since it seems to be synchronization related and happens randomly. Waiting between partitions fixes it mostly. The mt_kahypar_partition call gets stuck on tbb::detail::d1::task_group_base::wait(). Until I find a small enough example; Full stack trace when it gets stuck:

#7  0x00007ffff5012817 in tbb::detail::r1::wait(tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) ()
   from /opt/intel/oneapi/tbb/latest/lib/intel64/gcc4.8/libtbb.so.12
#8  0x00007ffff79cc1de in tbb::detail::d1::task_group_base::wait() [clone .isra.0] ()
   from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#9  0x00007ffff7a00b99 in mt_kahypar::Pool<mt_kahypar::DynamicHypergraphTypeTraits>::bipartition(mt_kahypar::ds::PartitionedHypergraph<mt_kahypar::ds::Dyn
amicHypergraph, mt_kahypar::ds::ConnectivityInfo>&, mt_kahypar::Context const&, bool) ()
   from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#10 0x00007ffff7b612c6 in mt_kahypar::DynamicHypergraphTypeTraits::PartitionedHypergraph mt_kahypar::(anonymous namespace)::multilevel_partitioning<mt_kah
ypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::Hypergraph&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*, bool)
 () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#11 0x00007ffff7b7334e in mt_kahypar::Multilevel<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Conte
xt const&, mt_kahypar::TargetGraph const*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#12 0x00007ffff7bb6495 in void mt_kahypar::rb::recursive_bipartitioning<mt_kahypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::
PartitionedHypergraph&, mt_kahypar::Context const&, int, int, mt_kahypar::OriginalHypergraphInfo const&, std::vector<unsigned char, tbb::detail::d1::scala
ble_allocator<unsigned char> >&) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#13 0x00007ffff7bb6f33 in mt_kahypar::RecursiveBipartitioning<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::PartitionedHypergraph<mt
_kahypar::ds::DynamicHypergraph, mt_kahypar::ds::ConnectivityInfo>&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*) ()
   from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#14 0x00007ffff7b6126f in mt_kahypar::DynamicHypergraphTypeTraits::PartitionedHypergraph mt_kahypar::(anonymous namespace)::multilevel_partitioning<mt_kah
ypar::DynamicHypergraphTypeTraits>(mt_kahypar::DynamicHypergraphTypeTraits::Hypergraph&, mt_kahypar::Context const&, mt_kahypar::TargetGraph const*, bool)
 () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#15 0x00007ffff7b7334e in mt_kahypar::Multilevel<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Conte
xt const&, mt_kahypar::TargetGraph const*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#16 0x00007ffff7b305f5 in mt_kahypar::Partitioner<mt_kahypar::DynamicHypergraphTypeTraits>::partition(mt_kahypar::ds::DynamicHypergraph&, mt_kahypar::Cont
--Type <RET> for more, q to quit, c to continue without paging--c
ext&, mt_kahypar::TargetGraph*) () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#17 0x00007ffff7b56f4c in mt_kahypar_partitioned_hypergraph_t mt_kahypar::internal::partition<mt_kahypar::DynamicHypergraphTypeTraits>(mt_kahypar_hypergra
ph_t, mt_kahypar::Context&, mt_kahypar::TargetGraph*) [clone .isra.0] () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so
#18 0x00007ffff6dfc42c in mt_kahypar_partition () from /home/carad/projects/d4v2/3rdParty/mt-kahypar/build/lib/libmtkahypar.so

Full kahypar log:

*******************************************************************************
*                            Partitioning Context                             *
*******************************************************************************
Partitioning Parameters:
  Hypergraph:                         
  Mode:                               direct_kway
  Objective:                          cut
  Gain Policy:                        cut
  Input File Format:                  hMetis
  Instance Type:                      hypergraph
  Preset Type:                        highest_quality
  Partition Type:                     n_level_hypergraph_partitioning
  k:                                  2
  epsilon:                            0.05
  seed:                               0
  Number of V-Cycles:                 0
  Ignore HE Size Threshold:           1000
  Large HE Size Threshold:            100
-------------------------------------------------------------------------------
Preprocessing Parameters:
  Use Community Detection:            true
  Disable C. D. for Mesh Graphs:      true

  Community Detection Parameters:
    Edge Weight Function:                degree
    Maximum Louvain-Pass Iterations:     5
    Minimum Vertex Move Fraction:        0.01
    Vertex Degree Sampling Threshold:    200000
    Number of subrounds (deterministic): 16
-------------------------------------------------------------------------------
Coarsening Parameters:
  Algorithm:                          nlevel_coarsener
  Use Adaptive Edge Size:             true
  Max Allowed Weight Multiplier:      1
  Maximum Allowed Hypernode Weight:   1
  Contraction Limit Multiplier:       160
  Deep ML Contraction Limit Multi.:   4294967295
  Contraction Limit:                  320
  Minimum Shrink Factor:              1.01
  Maximum Shrink Factor:              100
  Vertex Degree Sampling Threshold:   200000
  Number of subrounds (deterministic):16

  Rating Parameters:
    Rating Function:                  heavy_edge
    Heavy Node Penalty:               no_penalty
    Acceptance Policy:                best_prefer_unmatched
-------------------------------------------------------------------------------
Initial Partitioning Parameters:
  Initial Partitioning Mode:          recursive_bipartitioning
  Number of Runs:                     20
  Use Adaptive IP Runs:               true
  Min Adaptive IP Runs:               5
  Perform Refinement On Best:         true
  Fm Refinement Rounds:               2147483647
  Remove Degree-Zero HNs Before IP:   true
  Maximum Iterations of LP IP:        20
  Initial Block Size of LP IP:        5

Initial Partitioning Refinement Parameters:
  Rebalancing Algorithm:              do_nothing
  Refine Until No Improvement:        true
  Relative Improvement Threshold:     0
  Maximum Batch Size:                 1000
  Min Border Vertices Per Thread:     0

  Label Propagation Parameters:
    Algorithm:                        label_propagation
    Maximum Iterations:               5
    Rebalancing:                      true
    HE Size Activation Threshold:     100

  FM Parameters: 
    Algorithm:                        kway_fm
    Multitry Rounds:                  5
    Perform Moves Globally:           false
    Parallel Global Rollbacks:        false
    Rollback Bal. Violation Factor:   1
    Num Seed Nodes:                   5
    Enable Random Shuffle:            true
    Obey Minimal Parallelism:         false
    Minimum Improvement Factor:       -1
    Release Nodes:                    true
    Time Limit Factor:                0.25

  Flow Parameters: 
    Algorithm:                        do_nothing

-------------------------------------------------------------------------------
Refinement Parameters:
  Rebalancing Algorithm:              simple_rebalancer
  Refine Until No Improvement:        true
  Relative Improvement Threshold:     0.0025
  Maximum Batch Size:                 1000
  Min Border Vertices Per Thread:     50

  Label Propagation Parameters:
    Algorithm:                        label_propagation
    Maximum Iterations:               5
    Rebalancing:                      true
    HE Size Activation Threshold:     100

  FM Parameters: 
    Algorithm:                        kway_fm
    Multitry Rounds:                  10
    Perform Moves Globally:           false
    Parallel Global Rollbacks:        false
    Rollback Bal. Violation Factor:   1.25
    Num Seed Nodes:                   5
    Enable Random Shuffle:            true
    Obey Minimal Parallelism:         false
    Minimum Improvement Factor:       -1
    Release Nodes:                    true
    Time Limit Factor:                0.25

  Boundary FM Parameters: 
    Refine Until No Improvement:      true
    Num Seed Nodes:                   5
    Obey Minimal Parallelism:         true

  Flow Parameters: 
    Algorithm:                        flow_cutter
    Flow Scaling:                     16
    Maximum Number of Pins:           4294967295
    Find Most Balanced Cut:           true
    Determine Distance From Cut:      true
    Parallel Searches Multiplier:     1
    Number of Parallel Searches:      1
    Maximum BFS Distance:             2
    Min Rel. Improvement Per Round:   0.001
    Time Limit Factor:                8
    Skip Small Cuts:                  true
    Skip Unpromising Blocks:          true
    Pierce in Bulk:                   true
    Steiner Tree Policy:              lower_bound
-------------------------------------------------------------------------------
Shared Memory Parameters:             
  Number of Threads:                  16
  Number of used NUMA nodes:          2
  Use Localized Random Shuffle:       false
  Random Shuffle Block Size:          2
------------------------------------------------------------------------------- 

******************************************************************************** 
*                                    Input                                     * 
******************************************************************************** 
Hypergraph Information 
Name :  
# HNs : 72 # HEs : 94 # pins: 185 # graph edges: 27 
HE size             HE weight           HN degree           HN weight 
| min= 1            | min= 1            | min= 1            | min= 1           
| Q1 = 1            | Q1 = 4050         | Q1 = 2            | Q1 = 1           
| med= 1            | med= 4050         | med= 2            | med= 1           
| Q3 = 2            | Q3 = 4050         | Q3 = 3            | Q3 = 1           
| max= 7            | max= 4050         | max= 7            | max= 1           
| avg= 1.96809      | avg= 3274.66      | avg= 2.56944      | avg= 1           
| sd = 1.53421      | sd = 1601.72      | sd = 1.31969      | sd = 0           

******************************************************************************** 
*                              Preprocessing...                                * 
******************************************************************************** 
# Communities : 24 
# HNs Per Community # Internal Pins     Internal Degree Sum 
| min= 1            | min= 1            | min= 1           
| Q1 = 1            | Q1 = 3            | Q1 = 3           
| med= 2            | med= 4            | med= 4           
| Q3 = 4            | Q3 = 11           | Q3 = 11          
| max= 11           | max= 27           | max= 27          
| avg= 3            | avg= 7.70833      | avg= 7.70833     
| sd = 2.51949      | sd = 6.27625      | sd = 6.27625     
******************************************************************************** 
*                                Coarsening...                                 * 
******************************************************************************** 
Hypergraph Information 
Name : Coarsened Hypergraph 
# HNs : 72 # HEs : 94 # pins: 185 # graph edges: 27 
HE size             HE weight           HN degree           HN weight 
| min= 1            | min= 1            | min= 1            | min= 1           
| Q1 = 1            | Q1 = 4050         | Q1 = 2            | Q1 = 1           
| med= 1            | med= 4050         | med= 2            | med= 1           
| Q3 = 2            | Q3 = 4050         | Q3 = 3            | Q3 = 1           
| max= 7            | max= 4050         | max= 7            | max= 1           
| avg= 1.96809      | avg= 3274.66      | avg= 2.56944      | avg= 1           
| sd = 1.53421      | sd = 1601.72      | sd = 1.31969      | sd = 0           

******************************************************************************** 
*                           Initial Partitioning...                            * 
********************************************************************************

Thanks!

lothran commented 1 year ago

Ok, I did some more digging turns out not using the new oneTBB fixes it (-DKAHYPAR_DOWNLOAD_TBB=On).

kittobi1992 commented 1 year ago

Hi @lothran,

thanks for the information. I will try to reproduce the bug today. I will give you an update if I come up with a fix for it.

kittobi1992 commented 1 year ago

Hi @lothran,

I was not able to reproduce your bug yet. However, I found a bug in our library interface that causes undefined behaviour on our side. I am confident that this also fixes the deadlock in your application. If you find time you can checkout the latest state on master and test if this fixes your issue.