Open vidartf opened 4 years ago
I have spent some time studying how we are generating these ids. I have been generating ids using some simple code in python follow our algo along the lines of:
def random_path(x1, x2):
return x1 + round(random.random() * math.sqrt(x2 - x1))
def gen_paths(n, x1, x2):
ids = []
lower = x1
while (len(ids) < n):
y = random_path(lower, x2)
ids.append(y)
lower = y
return ids
To understand this I have run simulations and derived some approximate results for how this method generates sequences of random numbers. In the limit where x_2 >> x_1
you can cast this as a monotonic random walk:
X_i = U_i*\sqrt(x_2)
where U_i
is a uniform random variate on [0,1]
. When you call gen_paths(n, x1, x2)
to generate a list of n random paths in the range [x1,x2]
, the jth
value generated will be the sum:
S_j = \Sum_{i=1}^j X_i = \sqrt(x2) \Sum_{i=1}^j U_i
Because <U_i>=1/2
:
<S_j> = \sqrt(x2)*j/2
This expression for <S_j>
is the expected value of the jth
random path generated by this algorithm. We can also compute the mean spacing between consecutive values:
<S_j> - <S_{j-1}> = \sqrt(x2)/2
These two results show the following:
n
you are generating. Thus, for small values of n
, the points won't fill the interval [x1,x2]
and for large values of n
, a subset of values will fill that interval.n
can this algorithm generate in [x1,x2]
? That can be found by setting <S_n>=x2
in the above expression to get:n = 2*\sqrt(x2)
Let's make this more concrete and how many random paths this algorithm can generate in the range [0,2**10]
. In that case n=2*\sqrt(2**10)=64
. Thus, even though there are 1024 possible values in this range, our algorithm is only able to fill 64 values on average. I have done numerical simulations to validate these calculations and it all holds together.
This makes it clear why we are seeing memory consumption increase rapidly. Our path space is 6 bytes or 248 possible values (before allocating another triplex id). But in practice we get fewer values in this space. Because the mean spacing between values is the constant sqrt(x2)/2
, even a list with a small number of elements will have closely spaced paths. When `x2=248the spacing is only
8388608. The max number of items you can insert in that range with our algorithm is
2\sqrt(8388608)=5792`. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.
The fix will be to alter our path generation to vary the spacing with the number of values. The simplest approach is to have the average spacing between n values be x2-x1/n
. This will be challenging to implement, but let's work though the benefit...Let's say we have 5000 characters in the path space [0, 2**48]
. The initial path spacing will be 56294995342
. If we paste another 500 characters between 2 of the original the path spacing will be 11258999
. If we paste another 5000 between 2 of those, the spacing will be 2251
. Thus, we should be able to repeatedly paste a string of 5000 characters into an empty document 3 times before having longer triplex ids. If we increase our path space to 8 bytes, you can paste it 5 times. We will have to play around with the tradeoffs to work out the best overall performance.
I mostly agree with your points, but a few comments:
Because the mean spacing between values is the constant sqrt(x2)/2, even a list with a small number of elements will have closely spaced paths. When x2=2**48 the spacing is only 8388608. The max number of items you can insert in that range with our algorithm is 2\sqrt(8388608)=5792. Thus, if you paste text longer than 5792 characters repeatedly, every paste will require the allocation higher and higher memory using triplex ids.
Note that the random path density has two asymptotic behaviors: roughly linear along your constant at the start, exponential towards infinity when approaching the end of the range (but of course limited by binning to integers). Consequentially, I see that the number of IDs that are typically generated before the path is expanded is around 67 million (4 orders of magnitude higher).
let i = N;
let last = 0;
for (let i=0; i<2*N; ++i) {
last = createTriplexId(0, 0, last);
if (last.length > 8) {
console.log([i, last]);
break;
}
}
I have been picking through the details of how we can improve the triplex id generation. There are two main ideas I have been exploring:
randomPath(lp, hp)
by a function that just returns the average of lp and hp. I believe this will be optimal for inserts of single list elements at random positions. It will perform worse than our current implementation for pasting a large block of text as the number of consecutive inserts you can do through bisection is log(hp-lp). But it will be better for the random position inserts.n
evenly spaced ids in a path interval. That reduces to case (1) when n=1
and will lead to evenly spaced paths for n>1
. The implementation will be more complex, but I think I see my way through it.It's hard to give a proper review of the proposed methods without any (pseudo) code. However, I want to note that the current algorithm behaves pretty well for a somewhat typical editing scenario:
Maybe there exist a body of captured user edit patterns somewhere that can be used as a profiling baseline?
Here is a draft version of idea (2) above:
export
function createTriplexIds(n: number, version: number, store: number, lower: string, upper: string): string[] {
let ids: string[] = [];
whileIds: while (ids.length < n) {
const MAX_PATH = 0xFFFFFFFFFFFF;
let id = '';
let lowerCount = lower ? Private.idTripletCount(lower) : 0;
let upperCount = upper ? Private.idTripletCount(upper) : 0;
forCount: for (let i = 0, p = Math.max(lowerCount, upperCount); i < p; ++i) {
let lp: number;
let lc: number;
let ls: number;
if (i >= lowerCount) {
lp = 0;
lc = 0;
ls = 0;
} else {
lp = Private.idPathAt(lower, i);
lc = Private.idVersionAt(lower, i);
ls = Private.idStoreAt(lower, i);
}
let up: number;
let uc: number;
let us: number;
if (i >= upperCount) {
up = upperCount === 0 ? MAX_PATH + 1 : 0;
uc = 0;
us = 0;
} else {
up = Private.idPathAt(upper, i);
uc = Private.idVersionAt(upper, i);
us = Private.idStoreAt(upper, i);
}
// lower === upper
if (lp === up && lc === uc && ls === us) {
id += Private.createTriplet(lp, lc, ls);
continue forCount;
}
if ((up - lp - 1) >= (n - ids.length)) {
let paths = Private.generatePaths(n, lp, up)
for (let j = 0, m = n-ids.length; j < m; j++) {
ids.push(id + Private.createTriplet(paths[j], version, store));
}
return ids;
}
id += Private.createTriplet(lp, lc, ls);
upperCount = 0;
} // forCount
let np = Private.generatePath(1, 1, MAX_PATH);
id += Private.createTriplet(np, version, store);
ids.push(id.slice());
id = '';
} // whileIds
return ids;
}
The main idea is that when it gets to the point where is used to use Private.randomPath
to insert a new path, it now tests to see if there is room to insert n
evenly spaced paths. If not, it will copy the left triplet and allocate a new triplet. Some pros and cons of this:
Pros:
n=1
case is reduces to always taking the average of the upper and lower path (bisection). This should work well for random edits.Cons:
n
new paths between lp
and up
of the current triplet, then it will always add a new triplet, rather than fitting what it can before adding a new triplet. This could be changed, but it was simpler to implement this way.Here is the generatePaths
function:
namespace Private {
export
function generatePaths(n: number, min: number, max: number): number[] {
let m = max - min;
let delta = m/(n+1);
console.log(m,n+1,m/(n+1),delta);
let paths = []
for (let i = 1; i <= n; i++) {
paths.push(Math.floor(min + i*delta));
}
return paths
}
}
Also: one thing this work is pointed to is that we likely need a benchmark suite to test the performance different algorithms under different conditions.
I have a version working now with a benchmark notebook that does N
inserts at random positions in an initially empty text, using a Poisson distribution to model the number of characters inserted at each edit (average of 1). Here is a legend to interpret the results:
MB
: total memory in MB of the ids.total
: total number of 16 byte triplets across all ids.length
: final length of text.
`counts
: keys are the number of triplets in an id, and values are the number of ids with that number of triplets.For N=2**18
:
Existing algorithm:
{
MB: 20.768496,
total: 1298031,
length: 414683,
counts: {
'1': 4056,
'2': 82462,
'3': 203883,
'4': 105496,
'5': 17330,
'6': 1424,
'7': 32
}
}
New algorithm:
{
MB: 6.64032,
total: 415020,
length: 414772,
counts: { '1': 414552, '2': 197, '3': 18, '4': 5 }
}
The main findings here are that the memory usage is 3x less and the generated ids are able to remain shorter for a given number of list elements and edits. This is promising, so I will continue to work on improving the benchmark to cover different types of usage cases (copy/paste, deleting, etc.).
The current behavior skews new ids to the end of the range for large inserts.
Context: For creating a single triplex ID, a random path is generated within the start of the available range (uniformly in the first
Math.sqrt(max - min)
part of the range). Currently, when inserting a range, this ID generation is called sequentially, each time using the previous random ID as the start of the range.Behavior: With this logic, any sequence insert will have an ID distribution that is heavy towards the end of the insertion range (the density increases as more of the range is consumed). This effect is also further pronounced if the range is already small, i.e. for insertions within a previous insertion.
Proposed solution: For range inserts: