re-rxjs / react-rxjs

React bindings for RxJS
https://react-rxjs.org
MIT License
546 stars 18 forks source link

Improve performance of partitionByKey with a big number of elements #232

Closed voliva closed 2 years ago

voliva commented 2 years ago

From a benchmark I realised that partitionByKey had a time complexity of O(n^2) on an example like this:

function partitionByTest(n: number) {
  const array = new Array(n).fill(0).map((_, i) => i);

  const start = Date.now();
  const [, keys$] = partitionByKey(from(array), (v) => v);

  let result = 0;
  keys$.subscribe({
    next: () => {},
    complete: () => {
      result = Date.now() - start;
    }
  });

  return result;
}

That's because we are internally copying the inner map so that it could be consumed by getGroupedObservable, which was a function that was exposed with the old api of split + collect.

Now we're not exposing this function anymore, so it can directly take the original map without needing to map to something else.

codecov[bot] commented 2 years ago

Codecov Report

Merging #232 (30094da) into main (73776b4) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #232   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           24        24           
  Lines          313       343   +30     
  Branches        40        49    +9     
=========================================
+ Hits           313       343   +30     
Impacted Files Coverage Δ
packages/utils/src/combineKeys.ts 100.00% <100.00%> (ø)
packages/utils/src/partitionByKey.ts 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 73776b4...30094da. Read the comment docs.

voliva commented 2 years ago

I later realised that the example is still O(n^2), because keys$ still go through the whole map to serialize it to an Array, but it's still faster.

I'll try and think if we can avoid O(n^2) somehow.

voliva commented 2 years ago

I came up with something, and fancy charts 🤩

I thought could internally keep an array of the active keys, and keep it in sync with the map's keys so that I don't have to map over them. However, this solution still acts as an O(n^2):

image

Why is that? I think it's an effect of the test: Because it has as a source an observable that completes (in order to finalize the test and get how long it took), the keys need to be removed from the array 1-by-1 for each key, and because it's an array, I had to look the index of that key and splice it out, which is O(n) => For each key, it's O(n^2). I think if the source didn't complete it would be nlogn.. Essentially, when the inner group completes, I had this bit of code:

            const keyIdx = keys.indexOf(key)
            keys.splice(keyIdx, 1)

An alternative that fixes this requires a change on the API. If internally we use a set to keep track of active keys, and return that without transforming to an array, then it finally becomes linear 🎉

image

On a quadratic regression it gives a negative coefficient, meaning it's essentially 0 - It's now probably O(nlogn).

Would it be worth adding in this breaking change (keys$ is now Observable<Set<K>> instead of Observable<Array<K>>)?

josepot commented 2 years ago

Amazing work @voliva ! Thanks a lot for the amount of work and research that you've put into this!

Would it be worth adding in this breaking change (keys$ is now Observable<Set> instead of Observable<Array>)?

:100:

voliva commented 2 years ago

I needed to add one last commit, otherwise PartitionByKey + CombineLatest would also have a O(n^2) behaviour, but it's ugly :( I needed to do some sort of refcount. I still think it needs more thought.

I reran the benchmark, cleaned it up a bit to what's more relevant: Focusing to blocking time up to 1 second instead of 8, and for 3 different versions: the current version in main branch, a version of this PR without any API change, the latest version of this PR:

Number of elements on source from(array) x blocking time on initial subscription in ms

image image

I think those big jumps when dealing with big arrays I think it's Node's GC kicking in after some point.

Without that last commit, this PR has a slightly better curve than "Without deltas", but with the same parabolic shape.