mossblaser commented 8 years ago

Routing algorithm performance improvement

While performing various experiments I've been bitten by the routing process taking a ridiculous amount of time while performing large experiments. This PR is an effort to reduce this runtime...

Headline outcomes:

Routing is now 2-3x faster
The returned RoutingTrees now take up half as much memory as before.

Implemented:

[x] Speed up common-case where no links are broken (2f99746: reduces runtime to 2/3 of original)
[x] Optimise concentric_hexagons: consider non-generator, memoize concentric hexagon list (d8c05a8: very minor speedup)
[x] Optimise longest_dimension_first: just return a list (38c0662: not noticable speedup but cleans up code a little)
[x] Optimise RoutingTree memory usage (2c48ad6: halves memory consumption and yeilds minor speedup)
[x] Optimise neighbourhood exploration (294debd: reduces runtime to 3/5ths of original when routing long-distance nets)

Also under consideration for the future, probably not this PR:

Redefine RoutingTree to reduce memory use further: would be a quite breaking change but should mostly only impact routers
Build C implementation of NER kernel
Optimise shortest_torus_paths: low-hanging fruit but low-impact
Benchmark

To benchmark the performance of the router, a uniform randomly-connected network is created and routed by the script below. This benchmark reflects a 'bad' netlist in which most routes are quite long. This is currently something the router deals with very badly in terms of performance.

import random
import time

from math import sqrt, ceil

from rig.place_and_route import allocate, route, Cores, Machine
from rig.place_and_route.place.rand import place
from rig.netlist import Net

# Generate a random network
num_vertices = 10000
nets_per_vertex = 2
fan_out = 4
vertices = [object() for _ in range(num_vertices)]
nets = [Net(v, random.sample(vertices, fan_out))
        for v in vertices
        for _ in range(nets_per_vertex)]

# Generate P&R data-structures
vertices_resources = {v: {Cores: 18} for v in vertices}
w = h = int(ceil(sqrt(num_vertices)))
machine = Machine(w, h)

# Perform a few random placement runs and time the routing time
for _ in range(5):
    placements = place(vertices_resources, nets, machine, [])
    allocations = allocate(vertices_resources, nets, machine, [], placements)
    before = time.time()
    routes = route(vertices_resources, nets, machine, [], placements, allocations)
    print("Routed in {} seconds".format(time.time() - before))

On my laptop, using Python 3 and the existing implementation (2f0018a), the results are as follows:

Routed in 85.41314220428467 seconds
Routed in 87.82996225357056 seconds
Routed in 89.84850525856018 seconds
Routed in 88.47845983505249 seconds
Routed in 88.14478492736816 seconds

Tweak 1: Optimise common-case where no bad links are encountered

Running under cProfile the router was spending around 1/3rd of its time in the avoid_dead_links method which attempts to modify initially generated routes such that they route around any dead links. This function is rather naively implemented on the assumption that it will usually find a dead link and generates a working-copy of the routing tree as it checks it. Since most routes are not affected by dead links, a new function route_has_dead_links is used to quickly check whether dead links are actually present in a routing tree before trying to fix it.

After making this change (2f99746), the results are as follows:

Routed in 62.89778780937195 seconds
Routed in 66.00230073928833 seconds
Routed in 69.11281132698059 seconds
Routed in 63.75988221168518 seconds
Routed in 68.9121150970459 seconds

Tweak 2: Cache concentric hexagons sequence

The concentric_hexagons function is called once per destination. As a very simple generator the overhead of the generator is measurable. By memoizing this generator's output this overhead can be saved. A simple change resulting in a minor but measurable performance improvement.

After making this change (d8c05a8), the results are as follows:

Routed in 55.96174383163452 seconds
Routed in 65.61859607696533 seconds
Routed in 57.02670741081238 seconds
Routed in 55.88467574119568 seconds
Routed in 65.95263361930847 seconds

Tweak 3: Make `longest_dimension_first` just return a list

Avoid the need for a generator in a lot loop.

After making this change (38c0662), the results are as follows:

Routed in 55.16030263900757 seconds
Routed in 64.66953659057617 seconds
Routed in 55.68718242645264 seconds
Routed in 55.36920404434204 seconds
Routed in 58.20105767250061 seconds

So... probably not significant... That said since the generator's output was always converted to a list anyway this change is beneficial from a code-clarity point of view at least...

Tweak 4: Optimise `RoutingTree` memory usage

Before diving in and making a whole new complex data type for RoutingTrees that optimises out strings of singleton nodes, I've attempted to shrink the structure which already exists.

Approximate peak memory usage numbers on the benchmark (as measured using memory_profiler) for each of the changes made are enumerated below:

Original memory consumption: 1470 MB
Use a list rather than a set to store children of a RoutingTree node: 1150 MB
Use __slots__: 880 MB
Store chip coordinates as two numbers, not a tuple: 740 MB

After making this change (2c48ad6), the runtime also improves slightly as follows:

Routed in 51.36526560783386 seconds
Routed in 53.47503447532654 seconds
Routed in 51.021536111831665 seconds
Routed in 53.68679237365723 seconds
Routed in 50.90719032287598 seconds

Tweak 5: Optimise neighbourhood exploration

When generating routing trees, the NER algorithm searches for nearby vertices (or pieces of route) to route to rather than always routing back to the source of the net. Originally this was achieved by searching in a spiral pattern outward from the destination vertex looking for pieces of route to attach to. With the default search radius of 20 this requires 1261 checks, all of which are carried out when the destination is more than 20 hops from another part of the route.

A new, alternative implementation instead iterates over every known route segment to find out if any are within the specified search radius. In practice this often results in substantially fewer checks being required. To reap the benefits of both approaches, a simple heuristic automatically selects the approach to use on a route-by-route basis.

A substantial comment in the implementation (near the top of ner_net() in ner.py) explains this change in greater detail.

After making this change (294debd), the results are as follows:

Routed in 33.20937418937683 seconds
Routed in 33.3547580242157 seconds
Routed in 30.7742702960968 seconds
Routed in 34.18379735946655 seconds
Routed in 30.957439184188843 seconds

Regression check for the common case

As a regression check, the following alternative benchmark models a more well-behaved application in which most routes are very short.

import time

from math import sqrt, ceil

from rig.place_and_route import allocate, route, Cores, Machine
from rig.place_and_route.place.rand import place
from rig.netlist import Net

# Generate a network with regular nearest-neighbour torus connectivity
num_vertices = 10000
w = h = int(ceil(sqrt(num_vertices)))
vertices = [(x, y) for x in range(w) for y in range(h)]
nets = [Net((x, y), [((x+dx)%w, (y+dy)%h) for dx, dy in offsets])
        for x, y in vertices
        for offsets in ([(3, 0), (0, 3), (-3, 0), (0, -3)],
                        [(2, 2), (-2, 2), (-2, -2), (2, -2)])]

# Generate P&R data-structures and manually place the network
vertices_resources = {v: {Cores: 18} for v in vertices}
placements = {v: v for v in vertices}
machine = Machine(w, h)

# Perform a few runs and time the routing time
for _ in range(5):
    allocations = allocate(vertices_resources, nets, machine, [], placements)
    before = time.time()
    routes = route(vertices_resources, nets, machine, [], placements, allocations)
    print("Routed in {} seconds".format(time.time() - before))

Before these changes (2f0018a):

Routed in 10.543296813964844 seconds
Routed in 10.643112182617188 seconds
Routed in 10.55341911315918 seconds
Routed in 10.619398832321167 seconds
Routed in 10.859185457229614 seconds

After applying tweaks 1-5 (294debd) things have sped up for the common case too!

Routed in 7.551417112350464 seconds
Routed in 7.183601140975952 seconds
Routed in 7.2857115268707275 seconds
Routed in 7.339511156082153 seconds
Routed in 6.861915111541748 seconds

mossblaser commented 8 years ago

Realistically, how big might this become? Is it worth investigating a more cache-like structure?

In practice: 1 entry (with a list of 1261 2-element tuples).

New entries are only ever added when the NER router is called with a new radius value. In practice the only reason someone would change this (or call route more than once anyway for that matter) is for the purposes of doing routing experiments. In such cases I've been doing order tens of runs and so the cache size should still be very small in practice.

I considered making things get evicted but basically I couldn't be bothered to implement it due to the above. What do you think?

mundya commented 8 years ago

I considered making things get evicted but basically I couldn't be bothered to implement it due to the above. What do you think?

That seems entirely fair to me.

mossblaser commented 8 years ago

OK so I think this is about as far with this as I'm happy to go at the moment... Headline outcomes of this PR:

Routing is now 2-3x faster
The returned RoutingTrees now take up half as much memory as before.

The most controversial part of this patch set is that swapping sets for lists in the RoutingTree structure is strictly a breaking change. In practice, the only code which creates RoutingTrees outside the router is probably in test code for the routing paper. I think that thanks to the wonders of duck typing these tests probably will continue to work though they ideally would be updated...

mundya commented 8 years ago

This LGTM, apart from wanting an extra newline.

I think I'm happy to let this breaking change pass as (a) I agree that duck-typing should probably, generally, mean it's not an issue, (b) it's a mostly internal change.

project-rig / rig