Make triplex id distribution for ranges more homogenous

vidartf commented 4 years ago

The current behavior skews new ids to the end of the range for large inserts.

Context: For creating a single triplex ID, a random path is generated within the start of the available range (uniformly in the first Math.sqrt(max - min) part of the range). Currently, when inserting a range, this ID generation is called sequentially, each time using the previous random ID as the start of the range.

Behavior: With this logic, any sequence insert will have an ID distribution that is heavy towards the end of the insertion range (the density increases as more of the range is consumed). This effect is also further pronounced if the range is already small, i.e. for insertions within a previous insertion.

Proposed solution: For range inserts:

Use a uniform spacing between characters (some random noise needed/useful?).
If the space is less than a certain threshold, we need to add more bytes. What is the best way to do this? Probably to add more bytes at the insertion point, and use the full new sub-range as the new insert range. Alternatively, we can intersperse a number of "subspaces" along the original range, and spread the new insert across those. This can probably be tweaked, but that shouldn't be a blocker.

ellisonbg commented 4 years ago

I have spent some time studying how we are generating these ids. I have been generating ids using some simple code in python follow our algo along the lines of:

def random_path(x1, x2):
    return x1 + round(random.random() * math.sqrt(x2 - x1))

def gen_paths(n, x1, x2):
  ids = []
  lower = x1
  while (len(ids) < n):
    y = random_path(lower, x2)
    ids.append(y)
    lower = y
  return ids

To understand this I have run simulations and derived some approximate results for how this method generates sequences of random numbers. In the limit where x_2 >> x_1 you can cast this as a monotonic random walk:

X_i = U_i*\sqrt(x_2)

where U_i is a uniform random variate on [0,1]. When you call gen_paths(n, x1, x2) to generate a list of n random paths in the range [x1,x2], the jth value generated will be the sum:

S_j = \Sum_{i=1}^j X_i = \sqrt(x2) \Sum_{i=1}^j U_i

Because <U_i>=1/2:

<S_j> = \sqrt(x2)*j/2

This expression for <S_j> is the expected value of the jth random path generated by this algorithm. We can also compute the mean spacing between consecutive values:

<S_j> - <S_{j-1}> = \sqrt(x2)/2

These two results show the following:

The random paths generated by this algorithm are uniformly spaced on average.
That spacing is independent of the number of points n you are generating. Thus, for small values of n, the points won't fill the interval [x1,x2] and for large values of n, a subset of values will fill that interval.
The important question then becomes: on average, how many points n can this algorithm generate in [x1,x2]? That can be found by setting <S_n>=x2 in the above expression to get:

n = 2*\sqrt(x2)

Let's make this more concrete and how many random paths this algorithm can generate in the range [0,2**10]. In that case n=2*\sqrt(2**10)=64. Thus, even though there are 1024 possible values in this range, our algorithm is only able to fill 64 values on average. I have done numerical simulations to validate these calculations and it all holds together.

This makes it clear why we are seeing memory consumption increase rapidly. Our path space is 6 bytes or 248 possible values (before allocating another triplex id). But in practice we get fewer values in this space. Because the mean spacing between values is the constant sqrt(x2)/2, even a list with a small number of elements will have closely spaced paths. When `x2=248the spacing is only8388608. The max number of items you can insert in that range with our algorithm is2\sqrt(8388608)=5792`. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.

The fix will be to alter our path generation to vary the spacing with the number of values. The simplest approach is to have the average spacing between n values be x2-x1/n. This will be challenging to implement, but let's work though the benefit...Let's say we have 5000 characters in the path space [0, 2**48]. The initial path spacing will be 56294995342. If we paste another 500 characters between 2 of the original the path spacing will be 11258999. If we paste another 5000 between 2 of those, the spacing will be 2251. Thus, we should be able to repeatedly paste a string of 5000 characters into an empty document 3 times before having longer triplex ids. If we increase our path space to 8 bytes, you can paste it 5 times. We will have to play around with the tradeoffs to work out the best overall performance.

vidartf commented 4 years ago

I mostly agree with your points, but a few comments:

Because the mean spacing between values is the constant sqrt(x2)/2, even a list with a small number of elements will have closely spaced paths. When x2=2**48 the spacing is only 8388608. The max number of items you can insert in that range with our algorithm is 2\sqrt(8388608)=5792. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.

Note that the random path density has two asymptotic behaviors: roughly linear along your constant at the start, exponential towards infinity when approaching the end of the range (but of course limited by binning to integers). Consequentially, I see that the number of IDs that are typically generated before the path is expanded is around 67 million (4 orders of magnitude higher).

let i = N;
let last = 0;
for (let i=0; i<2*N; ++i) {
    last = createTriplexId(0, 0, last);
    if (last.length > 8) {
        console.log([i, last]); 
        break;
    }
}

ellisonbg commented 4 years ago

I have been picking through the details of how we can improve the triplex id generation. There are two main ideas I have been exploring:

Replace the randomPath(lp, hp) by a function that just returns the average of lp and hp. I believe this will be optimal for inserts of single list elements at random positions. It will perform worse than our current implementation for pasting a large block of text as the number of consecutive inserts you can do through bisection is log(hp-lp). But it will be better for the random position inserts.
Better than both our current and the bisection approach is to combine the triplex id functions to a single one that can generate n evenly spaced ids in a path interval. That reduces to case (1) when n=1 and will lead to evenly spaced paths for n>1. The implementation will be more complex, but I think I see my way through it.

vidartf commented 4 years ago

It's hard to give a proper review of the proposed methods without any (pseudo) code. However, I want to note that the current algorithm behaves pretty well for a somewhat typical editing scenario:

An initial range insert when an existing text is used as the initial state (e.g. opening a document)
Subsequent random smaller edits, potentially with a higher density of inserts at the end of the document (e.g. writing new paragraphs in prose text).

Maybe there exist a body of captured user edit patterns somewhere that can be used as a profiling baseline?

ellisonbg commented 4 years ago

Here is a draft version of idea (2) above:

export
function createTriplexIds(n: number, version: number, store: number, lower: string, upper: string): string[] {
  let ids: string[] = [];

    whileIds: while (ids.length < n) {
        const MAX_PATH = 0xFFFFFFFFFFFF;
        let id = '';
        let lowerCount = lower ? Private.idTripletCount(lower) : 0;
        let upperCount = upper ? Private.idTripletCount(upper) : 0;

        forCount: for (let i = 0, p = Math.max(lowerCount, upperCount); i < p; ++i) {
            let lp: number;
            let lc: number;
            let ls: number;
            if (i >= lowerCount) {
              lp = 0;
              lc = 0;
              ls = 0;
            } else {
              lp = Private.idPathAt(lower, i);
              lc = Private.idVersionAt(lower, i);
              ls = Private.idStoreAt(lower, i);
            }
            let up: number;
            let uc: number;
            let us: number;
            if (i >= upperCount) {
              up = upperCount === 0 ? MAX_PATH + 1 : 0;
              uc = 0;
              us = 0;
            } else {
              up = Private.idPathAt(upper, i);
              uc = Private.idVersionAt(upper, i);
              us = Private.idStoreAt(upper, i);
            }

            // lower === upper
            if (lp === up && lc === uc && ls === us) {
              id += Private.createTriplet(lp, lc, ls);
              continue forCount;
            }

            if ((up - lp - 1) >= (n - ids.length)) {
                let paths = Private.generatePaths(n, lp, up)
                for (let j = 0, m = n-ids.length; j < m; j++) {
                    ids.push(id + Private.createTriplet(paths[j], version, store));
                }
                return ids;
            }

            id += Private.createTriplet(lp, lc, ls);
            upperCount = 0;
        } // forCount

        let np = Private.generatePath(1, 1, MAX_PATH);
        id += Private.createTriplet(np, version, store);
        ids.push(id.slice());
        id = '';
  } // whileIds

  return ids;
}

The main idea is that when it gets to the point where is used to use Private.randomPath to insert a new path, it now tests to see if there is room to insert n evenly spaced paths. If not, it will copy the left triplet and allocate a new triplet. Some pros and cons of this:

Pros:

For the n=1 case is reduces to always taking the average of the upper and lower path (bisection). This should work well for random edits.
For paste operations, it close to optimal, in that it will fill the space between upper and lower uniformly, which leaves the maximal possible space between the generated ids for future inserts.
It is deterministic.

Cons:

If the algorithm can't insert create n new paths between lp and up of the current triplet, then it will always add a new triplet, rather than fitting what it can before adding a new triplet. This could be changed, but it was simpler to implement this way.
The algorithm isn't biased towards inserts at any particular location. This means if inserts are always done at the beginning or end, it will allocate new triplets faster than other approaches. This could be mitigated in changing how we call this function (or wrapping it by another) that uses this approach in the middle 1/3 of the interval, leaving 1/3 open at the start/end for inserts there.

ellisonbg commented 4 years ago

Here is the generatePaths function:

namespace Private {
  export
  function generatePaths(n: number, min: number, max: number): number[] {
      let m = max - min;
      let delta = m/(n+1);
      console.log(m,n+1,m/(n+1),delta);
      let paths = []
      for (let i = 1; i <= n; i++) {
          paths.push(Math.floor(min + i*delta));
      }
      return paths
  }
}

ellisonbg commented 4 years ago

Also: one thing this work is pointed to is that we likely need a benchmark suite to test the performance different algorithms under different conditions.

ellisonbg commented 4 years ago

I have a version working now with a benchmark notebook that does N inserts at random positions in an initially empty text, using a Poisson distribution to model the number of characters inserted at each edit (average of 1). Here is a legend to interpret the results:

MB: total memory in MB of the ids.
total: total number of 16 byte triplets across all ids.
length: final length of text. `counts: keys are the number of triplets in an id, and values are the number of ids with that number of triplets.

For N=2**18:

Existing algorithm:

{
  MB: 20.768496,
  total: 1298031,
  length: 414683,
  counts: {
    '1': 4056,
    '2': 82462,
    '3': 203883,
    '4': 105496,
    '5': 17330,
    '6': 1424,
    '7': 32
  }
}

New algorithm:

{
  MB: 6.64032,
  total: 415020,
  length: 414772,
  counts: { '1': 414552, '2': 197, '3': 18, '4': 5 }
}

The main findings here are that the memory usage is 3x less and the generated ids are able to remain shorter for a given number of list elements and edits. This is promising, so I will continue to work on improving the benchmark to cover different types of usage cases (copy/paste, deleting, etc.).

jupyterlab / lumino

Make triplex id distribution for ranges more homogenous #13