jupyterlab / lumino

Lumino is a library for building interactive web applications
https://lumino.readthedocs.io/
Other
635 stars 126 forks source link

Make triplex id distribution for ranges more homogenous #13

Open vidartf opened 4 years ago

vidartf commented 4 years ago

The current behavior skews new ids to the end of the range for large inserts.

Context: For creating a single triplex ID, a random path is generated within the start of the available range (uniformly in the first Math.sqrt(max - min) part of the range). Currently, when inserting a range, this ID generation is called sequentially, each time using the previous random ID as the start of the range.

Behavior: With this logic, any sequence insert will have an ID distribution that is heavy towards the end of the insertion range (the density increases as more of the range is consumed). This effect is also further pronounced if the range is already small, i.e. for insertions within a previous insertion.

Proposed solution: For range inserts:

ellisonbg commented 4 years ago

I have spent some time studying how we are generating these ids. I have been generating ids using some simple code in python follow our algo along the lines of:

def random_path(x1, x2):
    return x1 + round(random.random() * math.sqrt(x2 - x1))

def gen_paths(n, x1, x2):
  ids = []
  lower = x1
  while (len(ids) < n):
    y = random_path(lower, x2)
    ids.append(y)
    lower = y
  return ids

To understand this I have run simulations and derived some approximate results for how this method generates sequences of random numbers. In the limit where x_2 >> x_1 you can cast this as a monotonic random walk:

X_i = U_i*\sqrt(x_2)

where U_i is a uniform random variate on [0,1]. When you call gen_paths(n, x1, x2) to generate a list of n random paths in the range [x1,x2], the jth value generated will be the sum:

S_j = \Sum_{i=1}^j X_i = \sqrt(x2) \Sum_{i=1}^j U_i

Because <U_i>=1/2:

<S_j> = \sqrt(x2)*j/2

This expression for <S_j> is the expected value of the jth random path generated by this algorithm. We can also compute the mean spacing between consecutive values:

<S_j> - <S_{j-1}> = \sqrt(x2)/2

These two results show the following:

n = 2*\sqrt(x2)

Let's make this more concrete and how many random paths this algorithm can generate in the range [0,2**10]. In that case n=2*\sqrt(2**10)=64. Thus, even though there are 1024 possible values in this range, our algorithm is only able to fill 64 values on average. I have done numerical simulations to validate these calculations and it all holds together.

This makes it clear why we are seeing memory consumption increase rapidly. Our path space is 6 bytes or 248 possible values (before allocating another triplex id). But in practice we get fewer values in this space. Because the mean spacing between values is the constant sqrt(x2)/2, even a list with a small number of elements will have closely spaced paths. When `x2=248the spacing is only8388608. The max number of items you can insert in that range with our algorithm is2\sqrt(8388608)=5792`. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.

The fix will be to alter our path generation to vary the spacing with the number of values. The simplest approach is to have the average spacing between n values be x2-x1/n. This will be challenging to implement, but let's work though the benefit...Let's say we have 5000 characters in the path space [0, 2**48]. The initial path spacing will be 56294995342. If we paste another 500 characters between 2 of the original the path spacing will be 11258999. If we paste another 5000 between 2 of those, the spacing will be 2251. Thus, we should be able to repeatedly paste a string of 5000 characters into an empty document 3 times before having longer triplex ids. If we increase our path space to 8 bytes, you can paste it 5 times. We will have to play around with the tradeoffs to work out the best overall performance.

vidartf commented 4 years ago

I mostly agree with your points, but a few comments:

Because the mean spacing between values is the constant sqrt(x2)/2, even a list with a small number of elements will have closely spaced paths. When x2=2**48 the spacing is only 8388608. The max number of items you can insert in that range with our algorithm is 2\sqrt(8388608)=5792. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.

Note that the random path density has two asymptotic behaviors: roughly linear along your constant at the start, exponential towards infinity when approaching the end of the range (but of course limited by binning to integers). Consequentially, I see that the number of IDs that are typically generated before the path is expanded is around 67 million (4 orders of magnitude higher).

let i = N;
let last = 0;
for (let i=0; i<2*N; ++i) {
    last = createTriplexId(0, 0, last);
    if (last.length > 8) {
        console.log([i, last]); 
        break;
    }
}
ellisonbg commented 4 years ago

I have been picking through the details of how we can improve the triplex id generation. There are two main ideas I have been exploring:

  1. Replace the randomPath(lp, hp) by a function that just returns the average of lp and hp. I believe this will be optimal for inserts of single list elements at random positions. It will perform worse than our current implementation for pasting a large block of text as the number of consecutive inserts you can do through bisection is log(hp-lp). But it will be better for the random position inserts.
  2. Better than both our current and the bisection approach is to combine the triplex id functions to a single one that can generate n evenly spaced ids in a path interval. That reduces to case (1) when n=1 and will lead to evenly spaced paths for n>1. The implementation will be more complex, but I think I see my way through it.
vidartf commented 4 years ago

It's hard to give a proper review of the proposed methods without any (pseudo) code. However, I want to note that the current algorithm behaves pretty well for a somewhat typical editing scenario:

  1. An initial range insert when an existing text is used as the initial state (e.g. opening a document)
  2. Subsequent random smaller edits, potentially with a higher density of inserts at the end of the document (e.g. writing new paragraphs in prose text).

Maybe there exist a body of captured user edit patterns somewhere that can be used as a profiling baseline?

ellisonbg commented 4 years ago

Here is a draft version of idea (2) above:

export
function createTriplexIds(n: number, version: number, store: number, lower: string, upper: string): string[] {
  let ids: string[] = [];

    whileIds: while (ids.length < n) {
        const MAX_PATH = 0xFFFFFFFFFFFF;
        let id = '';
        let lowerCount = lower ? Private.idTripletCount(lower) : 0;
        let upperCount = upper ? Private.idTripletCount(upper) : 0;

        forCount: for (let i = 0, p = Math.max(lowerCount, upperCount); i < p; ++i) {
            let lp: number;
            let lc: number;
            let ls: number;
            if (i >= lowerCount) {
              lp = 0;
              lc = 0;
              ls = 0;
            } else {
              lp = Private.idPathAt(lower, i);
              lc = Private.idVersionAt(lower, i);
              ls = Private.idStoreAt(lower, i);
            }
            let up: number;
            let uc: number;
            let us: number;
            if (i >= upperCount) {
              up = upperCount === 0 ? MAX_PATH + 1 : 0;
              uc = 0;
              us = 0;
            } else {
              up = Private.idPathAt(upper, i);
              uc = Private.idVersionAt(upper, i);
              us = Private.idStoreAt(upper, i);
            }

            // lower === upper
            if (lp === up && lc === uc && ls === us) {
              id += Private.createTriplet(lp, lc, ls);
              continue forCount;
            }

            if ((up - lp - 1) >= (n - ids.length)) {
                let paths = Private.generatePaths(n, lp, up)
                for (let j = 0, m = n-ids.length; j < m; j++) {
                    ids.push(id + Private.createTriplet(paths[j], version, store));
                }
                return ids;
            }

            id += Private.createTriplet(lp, lc, ls);
            upperCount = 0;
        } // forCount

        let np = Private.generatePath(1, 1, MAX_PATH);
        id += Private.createTriplet(np, version, store);
        ids.push(id.slice());
        id = '';
  } // whileIds

  return ids;
}

The main idea is that when it gets to the point where is used to use Private.randomPath to insert a new path, it now tests to see if there is room to insert n evenly spaced paths. If not, it will copy the left triplet and allocate a new triplet. Some pros and cons of this:

Pros:

Cons:

ellisonbg commented 4 years ago

Here is the generatePaths function:

namespace Private {
  export
  function generatePaths(n: number, min: number, max: number): number[] {
      let m = max - min;
      let delta = m/(n+1);
      console.log(m,n+1,m/(n+1),delta);
      let paths = []
      for (let i = 1; i <= n; i++) {
          paths.push(Math.floor(min + i*delta));
      }
      return paths
  }
}
ellisonbg commented 4 years ago

Also: one thing this work is pointed to is that we likely need a benchmark suite to test the performance different algorithms under different conditions.

ellisonbg commented 4 years ago

I have a version working now with a benchmark notebook that does N inserts at random positions in an initially empty text, using a Poisson distribution to model the number of characters inserted at each edit (average of 1). Here is a legend to interpret the results:

For N=2**18:

Existing algorithm:

{
  MB: 20.768496,
  total: 1298031,
  length: 414683,
  counts: {
    '1': 4056,
    '2': 82462,
    '3': 203883,
    '4': 105496,
    '5': 17330,
    '6': 1424,
    '7': 32
  }
}

New algorithm:

{
  MB: 6.64032,
  total: 415020,
  length: 414772,
  counts: { '1': 414552, '2': 197, '3': 18, '4': 5 }
}

The main findings here are that the memory usage is 3x less and the generated ids are able to remain shorter for a given number of list elements and edits. This is promising, so I will continue to work on improving the benchmark to cover different types of usage cases (copy/paste, deleting, etc.).