What could be the direction for making Peritext support block elements

inkandswitch / peritext

A CRDT for asynchronous rich-text collaboration, where authors can work independently and then merge their changes.

https://www.inkandswitch.com/peritext

MIT License

656 stars 28 forks source link

What could be the direction for making Peritext support block elements #27

Open joelewis opened 2 years ago

joelewis commented 2 years ago

This is not an issue. Just want to pick the team's brain on what's the perspective on dealing with rich text elements with "layouts" like tables and lists. Kudos to the team, Peritext has come up with an efficient representation for holding both character data as well as formatting boundaries.

Handling block elements is a different beast altogether. I'll take tables for clarity as it's the most common block element, yet supports most complex operations. Here are the options I see with Peritext's current representation.

Encode blocks as special control character. For example a table will have control characters for <table-start>, <table-end>, <row-start>, <column-start>, .... etc.


[

{ char: "T", opId: "9@B", deleted: false, markOpsBefore: [
{
  action: "addMark",
  opId: "19@A",
  start: { type: "before", opId: "9@B" },
  end:   { type: "before", opId: "10@B" },
  markType: "bold"
}
] },
{ char: "t", opId: "1@A", deleted: true },
{ char: "h", opId: "2@A", deleted: false },
{ char: "e", opId: "3@A", deleted: false },
{ char: " ", opId: "4@A", deleted: false },
{ char: "f", opId: "5@A", deleted: false },
{ char: "o", opId: "6@A", deleted: false },
{ char: "x", opId: "7@A", deleted: false },
{ char: " ", opId: "10@B", deleted: false, markOpsBefore: [] },

{ char: "<table-start-char>", opId: "11@A", deleted: false },
//..... (rest of the table body with content interweaving table control characters)
{ char: "<table-end-char>", opId: "12@A", deleted: false },

]


- Makes formatting model play along nicely. But as quoted in the original article it takes significant complexity to tailor it for our desired outcomes. Keeps the base model simple but shifts the complexity to handling outcomes. 

2. Encode tables as trees (i.e semantic JSON) and use automerge's/custom JSON capabilities.

[

{ char: "T", opId: "9@B", deleted: false, markOpsBefore: [ { action: "addMark", opId: "19@A", start: { type: "before", opId: "9@B" }, end: { type: "before", opId: "10@B" }, markType: "bold" } ] }, { char: "t", opId: "1@A", deleted: true }, { char: "h", opId: "2@A", deleted: false }, { char: "e", opId: "3@A", deleted: false }, { char: " ", opId: "4@A", deleted: false }, { char: "f", opId: "5@A", deleted: false }, { char: "o", opId: "6@A", deleted: false }, { char: "x", opId: "7@A", deleted: false }, { char: " ", opId: "10@B", deleted: false, markOpsBefore: [] },

{layout: "table", opId: "11@B", deleted: false, rows: [/ other table components /]} // special layout JSON ]



- Adds additional basic units to the model but plays along well with existing CRDTs that support trees/JSON. Doesn't mean handling outcomes is any easier. Might take significant complexity to deal with operations that might result in an un-renderable table structure. 

None of the options might make sense actually. It's highly possible there are other better directions to take that handles semantic tree elements like tables and lists, which the team might already have a hang on. I'd be happy to wrap my head around what's cooking and maybe help in someway. 

CRDTs for working with non-schematic JSON is fairly straightforward. CRDTs for schematic JSON like a SQL database is slightly tricky but the operation types (insert row, insert column, update cell) are too simple to warrant a complex solution. But tables in a WYIWYG editors are the most complex of them all with operations like cell-merge and cell-splitting. As far as I've studied none of the existing CRDT literature touches on this particular topic. Given Peritext's direction this project might be the only one, that throws light in this area.

joelewis commented 2 years ago

I've been thinking about the nature of this problem and stumbled upon this generalization. What makes WYIWYG tables so difficult to handle? They are JSON objects with a grammar. i.e the arrangements of nodes within this tree has to follow a pattern to be valid. Now, what other trees with grammar do we deal with everyday as a programmer? you might have guessed! ASTs.

ASTs of programming languages follow a specific grammar, which makes the problem even more interesting. This means any CRDT that solves this problem will be portable to code editors and IDEs as well. Simply put, conflict-less merging strategy that always keep the source code syntactically valid at all times. Hopefully this puts a bigger picture for this problem and doesn't distract the goals of Peritext :)

kythyria commented 2 years ago

Now, what other trees with grammar do we deal with everyday as a programmer? you might have guessed! ASTs.

ASTs have the significant advantage that–unless you work in a language that likes macros–you may be able to tolerate a fixed grammar (at least, fixed for the lifetime of the document). As well, constraints like the table being rectangular after merged cells are considered are rarely handled as a grammar, assuming you want to have that as a constraint (maybe not; even if editing goes a little wonky one can certainly describe how to display such a table)

Oh, and it's sometimes useful to refer to columns of a row-major table (or rows of a column-major table) as a thing in themselves.

I've generally assumed that tables are an example of something that's just handled as a reference to an entirely different object held alongside the rich text rather than trying to hide it inside the rich text. Although this removes a possibly desired behaviour: tables that fuse together if nothing separates them. Conversely, one might want blockquotes to not fuse when abutting. Which of these is the more niche behaviour, and whether either is niche enough to not support, I don't know.

justjake commented 2 years ago

Are tables really that troublesome? Trying to model tabular data as a stream of tokens or a JSON structure that looks like a table (like an array-of-arrays) is troublesome, but this is over-constraining the problem. I helped design a convergent data model for tables at Notion recently that would work well using 3 convergent data types: Map (to group and address fields), Ordered Set (for defining the order of rows and the order of columns), and Rich Text (for defining the contents of cells).

The structure looks like this, in pseudocode:

type Table = {
  rowOrder: OrderedSet<RowId>
  columnOrder: OrderedSet<ColumnId>
  rows: Map<RowId, Row>
}

type Row = {
  cells: Map<ColumnId, RichText>
}

Map here is used for addressing, not for any kind of convergence property. You can sprinkle on some extra last-write-wins maps w/ Lamport clocks for row, column, and cell format. Use Lamport clocks for row and column IDs so that rows/columns not included in the ...Order fields sort reasonably.

kythyria commented 2 years ago

... I feel scooped, I was just in the middle of an edit to describe a similar structure. I guess it is obvious after all. That it works for merged cells isn't to me: if you go with rowspan and colspan, you can have two cells trying to occupy the same space, and merging has to fuse the cell contents without edits being lost or resurrecting the removed cell.

joelewis commented 2 years ago

type Table = {
  rowOrder: OrderedSet<RowId>
  columnOrder: OrderedSet<ColumnId>
  rows: Map<RowId, Row>
}

type Row = {
  cells: Map<ColumnId, RichText>
}

@justjake Interesting take. But I'm trying to wrap my head around how will a merged cell or a split-cell look like with this data model. Something like this

Any hints?

joelewis commented 2 years ago

I've generally assumed that tables are an example of something that's just handled as a reference to an entirely different object held alongside the rich text rather than trying to hide it inside the rich text.

@kythyria This is not an entirely bad idea! This solves a bunch of problems as well, like table operations need not strictly be a CRDT, idk. Like while contents and formatting compulsively need a sensible resolving algorithm, layout operations like table and lists might live with a simple LWW strategy. Just thinking out loud. I might be ridiculously wrong.

ept commented 2 years ago

I have written up a proposal for handling block elements in Peritext. It's not yet implemented or tested on users, so it may not be ideal, but I think it's a reasonably plausible way forward.

justjake commented 2 years ago

@joelewis

@justjake Interesting take. But I'm trying to wrap my head around how will a merged cell or a split-cell look like with this data model. Something like this

Any hints?

I consider cell "merge" a format property of the cell - as rowspan and colspan format properties at @kythyria suggests. To merge a [ left cell, right cell ] into a [ two column cell ], you copy the text content of right cell to the end of left cell's RichText, and then increment left cell's colspan format. Our data model needs this extension to support format attributes:

type Row = {
  cellFormat: Map<ColumnId, CellFormat>
  cells: Map<ColumnId, RichText>
}

// Each of the properties of the CellFormat object is a last-write-wins register defined by
// the write with the "latest" lamport clock
type CellFormat = LastWriteWinsObject<{
  rowspan?: number
  colspan?: number
  // You can store other stylistic info for the whole cell here, too.
  color?: string
  background?: string
}>

@kythyria

That it works for merged cells isn't to me: if you go with rowspan and colspan, you can have two cells trying to occupy the same space, and merging has to fuse the cell contents without edits being lost or resurrecting the removed cell [...]

A naive renderer might do something strange when cells with >2 rowspan or colspan overlap. I think this is solvable with some precedence rules for stacking the rendered cells:

Cells stack top to bottom with the first row the highest, and left to right with the first column the highest. The cell at { row: 0, col: 0 } stacks above all other cells. The cell at { row: 0, col: 1, rowspan: 2 } would stack over the cell { row: 1, col: 0, colspan: 2 }.
If a cell's origin point (it's RowId and ColumnId) is covered by another cell, don't render the covered cell at all. This resolves a confusing ambiguity where you can see some data, but it's not clear how to control the cell.

Overall, this scheme does allow cells that were previously "covered" to be "resurrected", but I don't think that's necessarily a bad thing.

@ept

I have written up a proposal for handling block elements in Peritext. It's not yet implemented or tested on users, so it may not be ideal, but I think it's a reasonably plausible way forward.

The tables in Martin's design (https://martinkl.notion.site/Block-elements-in-rich-text-CRDT-a3b69f886dbc4ad1abe81cea0b3e6623#56ec0311dc86439fa1fa791607e05b5b) basically equivalent to the design we're discussing - the only difference is the addressing scheme; I intended my Row/Column addressing to be equivalent to Map<(RowId, ColumnId), Cell>.

But, Martin suggests using a different scheme for larger cells:

It's not immediately obvious how to handle table cells that span several rows or columns. Perhaps they should be identified by a two-dimensional range (startRowId, startColumnId, endRowId, endColumnId)? However, that structure would risk users concurrently creating cells that partially overlap (for example, one cell that spans columns A and B, and another cell that spans columns B and C). Perhaps this is a sufficiently niche edge-case that it's okay to resolve such conflicts by last-write-wins.

I think this would lead to counterintuitive rendering for larger cells if rows or columns are re-ordered - if a column goes from being the second column to the last column, a colspan=2 cell in the first column suddenly covers the whole table. I think rendering stacking cells with a subtle UI affordance might be more intuitive.

kythyria commented 2 years ago

I have written up a proposal for handling block elements in Peritext. It's not yet implemented or tested on users, so it may not be ideal, but I think it's a reasonably plausible way forward.

@ept That is a lot of tag inference! It might do something odd with <details> and equivalently-shaped things (like many uses of <blockquote>). I'm reminded a little bit of Google Wave though: putting splitBlock before the text its formats apply to rather than the end. I kind of wonder if the consecutive-asides problem (and in some contexts users will happily trigger it) might be solved by–seeing as this refers to a schema in any case–allowing that to be represented by consecutive asides:

{splitblock blockType: "aside", parents: []}
{splitblock blockType: "p", parents: ["aside"]}
Text in the first aside
{splitblock blockType: "aside", parents: []}
{splitblock blockType: "p", parents: ["aside"]}
Text in the second aside

It's actually a bit easier for <details> since there's only one <summary> and it's at the start, but that might be getting into SGML-grade tag inference shenanigans.

@justjake The start/end scheme for tables is almost exactly how CSS Grid works if you don't let it automatically place cells, which might be considered a use case of tables distinct from tabular data: layout. Such layouts sometimes can be unrolled into flat text without confusion, too. And yes, it does allow you to overlap cells that are given explicit boundaries.

HTML handles "overlapping" table cells by forbidding it: cells go into the next available space, after row/column spanning are accounted for (of course, it does this by cells not having coordinates at all, which makes rearranging them in a CRDT way annoying at best). Unintuitive, admittedly (and you need some odd commands to repair it, such as "delete just this one cell, shifting the rest of just this row left"), but it never hides a cell and is at least somewhat explainable.

kythyria commented 2 years ago

Oh, and being reminded of Google Wave reminded me of something else of its data model: At one level, it was basically an outliner: a tree of nodes each of whose data was a rich text document. Which can be used for a more hierarchical version of Jupyter-esque notebooks when you think about it: a mix of nodes with rich text, and nodes that hold data or code for a computation.

Not very traditional-wysiwyg-editor like, though, even if it does potentially let you casually drag the nodes around or skin them as a forum thread or use them as an outliner.

josephg commented 2 years ago

My general opinion on this is to do what @kythyria suggested and use a separate (embedded) data structure. This is also what we've been talking about in the braid working group. And its what we ended up doing for sharedb, which was inspired by sharejs, which was inspired by my work on Wave (and talking to some of the engineers on that after the project was over).

So I imagine something like this: We have a few 'primitive' data structures:

CRDT LWW (or, arbitrary-writer wins) register and keep-all-conflicting-values register
CRDT map / set / JSON objects.
List / plaintext CRDT (what I'm building at the moment in diamond types )
Rich text CRDT (this is a list CRDT with support for annotation ranges)
... And maybe others.

Then the idea is that each value in the system is another CRDT value. Registers, maps, lists and rich text CRDTs can embed children inside, which are themselves registers, maps, lists and more rich text documents. And so on. And you could do all this either in a dynamically typed way (like JSON) or with a schema.

Then if you have a list of todo items, each item can be JSON value with a few fields. One of the fields is a string description, which is implemented using a CRDT designed for editing text. If the UI supports it, the todo list itself could be embedded or linked into a rich text document. It would take up length=1 like any other item (character) in the rich text document.

If tables need special handling, they can be implemented using any of the above tools. Or someone could build special table semantics using a different construction if they want. And then tables can recursively support the same feature set - because what do table cells contain? Another embedded rich text CRDT.

kythyria commented 2 years ago

This brings to mind a more upper-layer question: At what point does structure become "too complex" for putting inside a richtext and warrant an embed/link? There's several answers of varying flavours of dogma that all probably produce a noticeably different UI.

You might try to fit everything into the rich text type. Or you might go the other extreme, and have nothing more complicated than paragraphs inside the rich text. That latter likely wouldn't resemble a "traditional" RTE so much: outliner or notebook might be more like it, with a departure from conventional selection, cursor motion, etc.

If even simple headings and lists are embeds/alternate cell types, that would... I actually don't know how awkward it would be to use, or if the desire for explicit handling for moves becomes much more urgent.

josephg commented 2 years ago

My instinct is to have the rich text type be a list of either characters or embedded items. And support annotated ranges (bold, italics, link, etc). I think that gets you 90% of the way there. You still need a convention / common schema for how tables and embedded images and so on are expressed, but I think thats fine.

As for what should be an embed vs a link, I feel like that decision is as old as time. And I don't have any special answers. Maybe transclusion would work well, where embeds are links - but they can be fetched and embedded automatically by the containing system. But in a word doc, you expect any embedded images to travel with the .docx file. (And be deleted when the doc is deleted). So maybe how links work might simply depend on the use case.

geoffreylitt commented 2 years ago

When considering tradeoffs of various approaches here, I think it's worth highlighting that splitting/joining across blocks seems like an important requirement, at least for a "word processor" use case where a document feels like one contiguous thing. (A blocks-style UI like Notion might have different requirements.)

Let's say we treat paragraphs as a kind of block element. Imagine we start with a one-paragraph document:

I saw a quick fox. It ran across the street.

Alice and Bob are concurrently editing. Alice splits the paragraph between the two sentences:

I saw a quick fox.
---
It ran across the street.

Concurrently, Bob adds some text to the end of the paragraph:

I saw a quick fox. It ran across the street and jumped across the fence.

The best merge result is pretty clearly:

I saw a quick fox.
---
It ran across the street and jumped across the fence.

However, if we interpret "paragraph split" as "delete some characters from the paragraph and insert a new paragraph containing that text", then we'd probably end up with Bob's insertion showing up in the original first paragraph, like this:

I saw a quick fox. and jumped across the fence
---
It ran across the street.

This is kind of a similar issue to the problems with a JSON object of format spans in the Peritext essay. It seems relevant for paragraphs, list elements, headings--anything that we might treat as a block element. Anytime two users concurrently split a block element in different places, or concurrently split + insert, it seems like we'll run into issues. FWIW, the yjs prosemirror integration demonstrates these kinds of problems because it treats each paragraph as a separate object.

I think the key advantage of Martin's proposal is that it handles these kinds of cases nicely by working with boundary characters rather than isolated objects.

ASTs of programming languages follow a specific grammar, which makes the problem even more interesting. This means any CRDT that solves this problem will be portable to code editors and IDEs as well.

There is indeed a connection here; e.g. Prosemirror allows defining schemas governing what kinds of configurations of elements comprise a valid document. However I'm quite wary of trying to support complicated schemas on block elements because many constraints seem very difficult to support in a CRDT. For example, what if a parent can only contain a single child of a particular type, and two users concurrently add a child of that type? Maybe there are ways to solve that particular problem, but the general case seems hard.

kythyria commented 2 years ago

For example, what if a parent can only contain a single child of a particular type, and two users concurrently add a child of that type? Maybe there are ways to solve that particular problem, but the general case seems hard.

I think if I were seriously trying to implement it I'd start with a schema that only records element names but not how many of each or in what order. That would probably be simpler to implement than one that can figure out that

{splitBlock blockType: "p" parents: []}
Zero
{splitBlock blockType: "summary" parents: ["details"]}
One
{splitBlock blockType: "p" parents: ["details"]}
Two
{splitBlock blockType: "summary" parents: ["details"]}
Three
{splitBlock blockType: "p" parents: ["details"]}
Four
{splitBlock blockType: "p" parents: []}
Five

should decode to

<p>Zero</p>
<details>
  <summary>One</summary>
  <p>Two</p>
</details>
<details>
  <summary>Three</summary>
  <p>Four</p>
</details>
<p>Five</p>

especially since if <summary> is made mandatory that complicates getting rid of the <details> entirely. Also you'd indeed have to deal with concurrent adds violating the schema, at least for long enough for the users who did it to agree on how to fix it.

kythyria commented 2 years ago

My instinct is to have the rich text type be a list of either characters or embedded items. And support annotated ranges (bold, italics, link, etc). I think that gets you 90% of the way there. You still need a convention / common schema for how tables and embedded images and so on are expressed, but I think thats fine.

As for what should be an embed vs a link, I feel like that decision is as old as time. And I don't have any special answers. Maybe transclusion would work well, where embeds are links - but they can be fetched and embedded automatically by the containing system. But in a word doc, you expect any embedded images to travel with the .docx file. (And be deleted when the doc is deleted). So maybe how links work might simply depend on the use case.

It probably does get 90% of the way there, the problem is splitting and joining embeds (and moving, if they're entirely contained within a conventional richtext). FWIW I'd say the distinction is that it's a link if it points outside of whatever the unit of security and subscription is, and you shouldn't use a richtext as the top level unless it can in fact contain embeds somehow.

josephg commented 2 years ago

When considering tradeoffs of various approaches here, I think it's worth highlighting that splitting/joining across blocks seems like an important requirement ... I think the key advantage of Martin's proposal is that it handles these kinds of cases nicely by working with boundary characters rather than isolated objects.

Yeah; This is also what we did for wave. A wavelet's contents were an XML-like structure with annotations. The team spent a lot of time trying to make waves work like this:

<wavelet>
  <p>Paragraph one</p>
  <p>Paragraph two</p>
</wavelet>

But making split and join operations work in a concurrent way eluded us. Months of work were put in trying to make it work correctly, until we restructured the document to look like this:

<wavelet>
    Paragraph one
  <br />
    Paragraph two
</wavelet>

... And then the algorithms got way simpler (because its just a list). I haven't seen anyone get splitting and joining working properly - though I haven't been looking. But yeah, I suspect its trouble.

the problem is splitting and joining embeds (and moving, if they're entirely contained within a conventional richtext)

Moving is definitely important for some applications. Its much easier if what you're moving around is a link, or moves are "flat" - ie, "move 1 item from index 10 to index 20". I added arbitrary tree based moves to the JSON1 OT code, and its fiendishly complex. And you get all sorts of awful situations, like this:

doc = {a: {}, b: {}}
op1 = (move a inside b)
op2 = (move b inside a)

What should it even do in that case? I don't know. I think the best case is for it to enter some sort of conflict state.

I'd say the distinction is that it's a link if it points outside of whatever the unit of security and subscription is, and you shouldn't use a richtext as the top level unless it can in fact contain embeds somehow.

I like this; though this design is very application-specific.

kythyria commented 2 years ago

But making split and join operations work in a concurrent way eluded us. Months of work were put in trying to make it work correctly, until we restructured the document to look like this: Nitpick: wasn't it
<wavelet>
<line/>Paragraph one
<line/>Paragraph two
</wavelet>
ie, a flatter predecessor of Martin's proposal, with different, more application-specific rules for how you figure out <ol> and friends.

Not gonna disagree on markup trees being a giant pain though. I've tried to figure it out for OT a few times and in hindsight made horrible mistakes every time. Along with identifying the problem that any schema enforcement by generating extra edits will be its own barrel of laughs to implement.

I'm actually surprised CKEditor went with that latter part, seeing as they wanted collaborative editing. Such a mechanism would need a convergence guarantee of its own, and that doesn't look at all easy to achieve, whether or not the edits are transmitted anywhere. (edit: whereas Martin's proposal makes every state valid, if possibly silly)

josephg commented 2 years ago

Yeah your nitpick is correct. Same basic idea either way though - the core idea being essentially making newlines as characters.

Not gonna disagree on markup trees being a giant pain though.

The worst part of markup trees is doing formatting annotations - which in wave were also done separately from the XML structure. Peritext has a markup system quite similar to wave's approach.

The other awful part of wave's XML approach was that we tried to embed lots of different data types into / on top of our XML structure. For example, 3rd party widgets ("gadgets") would be handed a subtree of a wave to use to store their data. But gadgets wanted JSON, and doing concurrent maps and sets on top of a concurrent XML tree was awful. Much better to just (as I suggested upthread) embed an OT / CRDT object specific to the desired data semantics in place.

I don't know much about CKEditor, but I'm convinced all sorts of OT / CRDT design mistakes have been made over the years by people who haven't bothered with fuzzers. All OT / CRDT algorithms I've ever designed or implemented have had mistakes in various ways that fuzzers have found. Many of these problems were quite fundamental (like the splitting / merging lines issue).

kythyria commented 2 years ago

But gadgets wanted JSON, and doing concurrent maps and sets on top of a concurrent XML tree was awful.

Yikes, that's sort of the opposite of using JSON to do rich text! XML with rich text nodes seems to be quite popular though: CKEditor, Slate, and ProseMirror all do it in some fashion. So is just the annotations (eg, Quill).

I kind of wonder if Wave could have gotten away with inlining blips into one big one; certainly the UI never supported moving them.

More directly, I'm on the fence about what should be part of a richtext CRDT and what should be embeds/links. Tables should probably be embeds, grids could go either way but probably need the grid layouts to be outside the richtext, headings and lists can easily-ish be inline; it's blockquote, aside, and details that are really vexing. And how selections should behave around embeds: does it make sense to allow

<blockquote>
  Foo
  <?start-selection?>
  Bar
</blockquote>
<blockquote>
  Baz
  <?end-selection?>
  Barrow
</blockquote>

or should the selection snap to encompass the entire of both blocks? Quassel does the latter for messages in a chat scrollback and it works well there, but that's a different use case.

Where lists fall partly depends on if entire paragraphs can be in a list. If not, then startListOfType and indentLevel properties on paragraphs/lines don't have obvious defects regarding editing behaviour. "Obvious" in that I can't think of some, if you can formalise "defective" a fuzzer or proof might still find it.

chriseidhof commented 2 years ago

@ept I read your proposal and really like it.

I've been thinking about lists and there is a caveat. Let's say user A creates three unordered list items, and adds them to a document. A <ul> is inferred automatically. Concurrently, user A changes the list type to <ol>, and user B adds a fourth item. What should the result be after merging? I would intuitively say an ordered list with four items, but as per your proposal it would be two lists (an ordered one with three items followed by an unordered list with a single item).

To be more precise, there are a few ways you can change a list type. You could move your cursor to one of the items, and change the state from unordered to ordered. In Apple Notes / Pages this changes the current list item (e.g. it renders as a single list with mixed bullets/numbers). In Microsoft Word, this toggles the entire list's type. Your proposal adds yet another behavior.

This isn't a big problem per se. However, it'd be nice to be able to choose the behavior. I haven't thought of a solution yet, I wonder if it could be something along the lines of list items inheriting from their previous items, or if it would be something similar to addMark but at the block-level. Either of these has some obvious drawbacks.