Optimize distinct - Githubissues

danieledapo commented 4 years ago

In this PR I've tried to speedup the distinct command by optimizing the simulated annealing process and the rearranging of colors.

Non scientific benchmarks show that in fact the proposed changes greatly improve performance. For example, now I'm able to run pastel distinct 200 | cat in <1 sec while without these changes it would have taken ~25secs on my machine. However, on small inputs the boost is not as big; in fact pastel distinct 20 | cat now runs in ~0.2secs vs ~0.4secs which is still better, but still a bit slow imho.

The simulated annealing was optimized mainly by avoiding to recalculate the distances from scratch every time , but by only recalculating the distances to the color that changed. This is the main contributing factor to the speedup.

After having optimized that, cargo flamegraph showed that a bunch of time was spent inside the rearrange_sequence function. A first optimization was to simply switch to using sort_by_cached_key because calculating the key at each comparison is really too expensive. Then, I went ahead and removed the sorting altogether because I think there was no need for it and introduced a cache for the minimum distances.

At this point I don't see any other low hanging fruits and I'm likely to stop here as I'm ok with the results.

Let me know what you think.

(also, your implementation of simulated annealing was really cool! simulated annealing is one of my favourite algorithms :smile: )

sharkdp commented 4 years ago

Thank you very much for working on this!

The speed improvements are impressive. Unfortunately, the behavior isn't quite the same and I can not reproduce the same results. This is a bit difficult to measure, as there is randomness involved, but you can try the following: Run pastel distinct -v 6 and take a look at the optimized D_min score. For the old version, it should consistently be around 103.2. With your version, it also reaches 103.2 sometimes, but it also yields results which are much worse.

10 runs of the old version:

10 runs of the new version:

What I can also observe from running pastel distinct with the -v/--verbose option is that the D_mean score is excessively high in your version. It looks like a some scalar::MAX values accidentally made it into the computation for the mean.

I will add some inline comments and see if I can find any mistakes.

Thanks!

sharkdp commented 4 years ago

After having optimized that, cargo flamegraph showed that a bunch of time was spent inside the rearrange_sequence function. A first optimization was to simply switch to using sort_by_cached_key because calculating the key at each comparison is really too expensive. Then, I went ahead and removed the sorting altogether because I think there was no need for it and introduced a cache for the minimum distances.

Using sort_by_cached_key sounds very reasonable, but I don't understand the second part. What do you mean by "there was no need for it"? I didn't really understand your new implementation when taking a first look, but I can try to take a closer look again.

danieledapo commented 4 years ago

After having optimized that, cargo flamegraph showed that a bunch of time was spent inside the rearrange_sequence function. A first optimization was to simply switch to using sort_by_cached_key because calculating the key at each comparison is really too expensive. Then, I went ahead and removed the sorting altogether because I think there was no need for it and introduced a cache for the minimum distances.

Using sort_by_cached_key sounds very reasonable, but I don't understand the second part. What do you mean by "there was no need for it"? I didn't really understand your new implementation when taking a first look, but I can try to take a closer look again.

I'm sorry it wasn't clear, I could have explained better.

My understanding of the algorithm is that it has to rearrange the sequence of a colors in such a way that the distance between a color and its predecessors is maximized.

My idea was to do something similar to insertion sort where you search for the element to insert at position i by finding the minimum of the elements from i+1 to the end of the input. I said there is no need to sort the "right" part of the sequence because I think you really need to just find the minimum which is O(n) vs O(n log(n)).

Moreover at each step we can be clever and avoid recomputing the minimal distance from each element in the "right" part to each element in the "left" one because we know which element we inserted in the previous run of the loop and so we can update the previous differences with only the new value.

sharkdp commented 4 years ago

Thank you for the clarification. I somehow missed that the outer loop was still there and got confused as to how that still works :smile:. Yes, that makes sense! Thank you for the clarification.

If I understand it correctly, your changes brings the complexity from O(N² log N) down to O(N²).

Moreover at each step we can be clever and avoid recomputing the minimal distance from each element in the "right" part to each element in the "left" one because we know which element we inserted in the previous run of the loop and so we can update the previous differences with only the new value.

~This part I still don't quite understand. We did that before as well, right? I only called sort_unstable_by_key on the right part. In your implementation, wouldn't it be the same if we just used min_by_key on the right part? All min_distances in the right part are recomputed anyway.~

I think I get it now :smile:

danieledapo commented 4 years ago

If I understand it correctly, your changes brings the complexity from O(N² log N) down to O(N²).

This is correct if we ignore how the calculation of the keys was done. Previously for each element in "right" it was iterating over each element in "left" to find the maximum which I believe is O(N²) (but I never understood big O :) ) so the final complexity of the previous algorithms (I think) was O(N² N² log N) but now it's O(N²) because this calculation is cached and then updated at each step of the outer loop.

sharkdp commented 4 years ago

Ah, right. That explains why it was so slow for larger N :smile:

sharkdp commented 4 years ago

Thank you!

danieledapo commented 4 years ago

Thank you for merging so quickly! :rocket: :tada:

sharkdp commented 4 years ago

Released in v0.6.0.

sharkdp / pastel

Optimize distinct #83