Open scottmessinger opened 4 days ago
I was so fixated on strings that I didn't consider the other inline content which you mention in the Readme:
(e.g., an image inline in a text document)
So, this approach is probably only limited to only text. If the approach explored in this issue has potential, one option might be to have a bag of non-text things that are referenced by a number that's outside the utf-16 range (so as not to conflict).
@mweidner037 It was great meeting yesterday! After our talk, I've been thinking about two things you said: (a) that translating the position to the run length encoding is the trickiest part of the code and (b) it's a linked list.
I started by trying to understand what what the linked list implementation looked like and discovered each node was a JS object. I should have assumed that it had to be an object as JS doesn't have native linked lists and I could picture any other way to do it. That got me wondering -- what is the memory overhead of an object? And then, I wondered -- if this was just a list of bytes, would it even have to be a linked list? Could simplifying the data structure into a Uint array reduce the memory pressure AND reduce the need for rle?
And, after a day of working on it, I present to you...an idea! I don't know if this actually solves the problem -- I could be way off in the wilderness here. But! It was really interesting to learn! I settled on a 32 bit unsigned array where I'd put each grapheme (as emoji surrogate pairs can be two utf-16 characters and each element needs to be the same width). Then, if a character was deleted, it could be nulled out. The hard part was figuring out how to convert a string into a list of graphemes so the index would be the same as the ProseMirror position. Eventually, I came across a library which does this performantly. Then, I just had to figure out how to get the 32 bit value for the grapheme (turns out, the easiest part).
TL;DR: if this actually supports the requirements, it's ~1/2 the memory and ~2x the speed.
Here's the test I put together (if you go to this page and open devtools, you can run them):
When I use the Chrome Devtools heap snapshot:
If I'm understanding how this is used in the code, the user would have to delete half the characters they type for the sparseArray to be more memory efficient. Is that correct?
In terms of performance, it appears using a Uint32 array is 2x as performant:
produces:
I have no idea if this is a viable avenue, but I'd love to hear your thoughts! If nothing else, this should help me have a much deeper understanding of everything behind the scenes of list-positions.