Would using UInt32 arrays be more memory efficient?

@mweidner037 It was great meeting yesterday! After our talk, I've been thinking about two things you said: (a) that translating the position to the run length encoding is the trickiest part of the code and (b) it's a linked list.

I started by trying to understand what what the linked list implementation looked like and discovered each node was a JS object. I should have assumed that it had to be an object as JS doesn't have native linked lists and I could picture any other way to do it. That got me wondering -- what is the memory overhead of an object? And then, I wondered -- if this was just a list of bytes, would it even have to be a linked list? Could simplifying the data structure into a Uint array reduce the memory pressure AND reduce the need for rle?

And, after a day of working on it, I present to you...an idea! I don't know if this actually solves the problem -- I could be way off in the wilderness here. But! It was really interesting to learn! I settled on a 32 bit unsigned array where I'd put each grapheme (as emoji surrogate pairs can be two utf-16 characters and each element needs to be the same width). Then, if a character was deleted, it could be nulled out. The hard part was figuring out how to convert a string into a list of graphemes so the index would be the same as the ProseMirror position. Eventually, I came across a library which does this performantly. Then, I just had to figure out how to get the 32 bit value for the grapheme (turns out, the easiest part).

TL;DR: if this actually supports the requirements, it's ~1/2 the memory and ~2x the speed.

Here's the test I put together (if you go to this page and open devtools, you can run them):

import { graphemeSegments } from 'unicode-segmenter/grapheme';

let graphemeBunch = new Uint32Array(3200000)
let sparseArray =  SparseArray.new()

let j = 0
for(let i=0; i<100000; i++){ 
    for(let {segment, index} of graphemeSegments("hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!")){ 
        graphemeBunch.set([segment.codePointAt(0)], j)
        sparseArray.set(j, segment)
        j++
    }
}

When I use the Chrome Devtools heap snapshot:

graphemeBunch retains 12 800 112 bytes
sparseArray retains   27 798 232 bytes

If I'm understanding how this is used in the code, the user would have to delete half the characters they type for the sparseArray to be more memory efficient. Is that correct?

In terms of performance, it appears using a Uint32 array is 2x as performant:

let graphemeBunch = new Uint32Array(3200000)
let sparseArray =  SparseArray.new()
let bunchIndex = 0
let sparseIndex = 0
console.time("bunches")
for(let i=0; i<100000; i++){ 
    for(let {segment, index} of graphemeSegments("hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!")){ 
        graphemeBunch.set([segment.codePointAt(0)], bunchIndex)
        bunchIndex++
    }
}
console.timeEnd("bunches")
console.time("sparse")
for(let i=0; i<100000; i++){ 
    for(let {segment, index} of graphemeSegments("hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!hello🤦‍♂️!!")){ 
        sparseArray.set(sparseIndex, segment.codePointAt(0))
        sparseIndex++
    }
}
console.timeEnd("sparse")

produces:

bunches: 228.232177734375 ms
sparse:  458.48681640625 ms

I have no idea if this is a viable avenue, but I'd love to hear your thoughts! If nothing else, this should help me have a much deeper understanding of everything behind the scenes of list-positions.

mweidner037 / sparse-array-rled

Would using UInt32 arrays be more memory efficient? #5