Define git merge compatible storage format

martin-lysk commented 4 weeks ago

MESDK-114 See loom for more infos: https://www.loom.com/share/2927ef8417a4456589f8e2a5fc0a186a?sid=100077c8-01aa-4c83-9417-44ba577046d8

Term definition: A record is an atomic entity in a database. Operations on such an entity can happen sequencialy but not in parallel. Updates in parallel should raise a conflict to the application level.

What represents an entity and should be atomic needs to be carefully defined during the design of the data model.

In terms of inlang an atom should most likely be a message:

A change in one part of a message, like editing/removing or adding a variant can usually not happen independently. Changes on separate languages - namely spoken, in separate messages - often do not relate.

An ideal storage format should therefore full fill the following properties:

1. A Merge of edits in different records should not lead to conflicts

Example:

User Alice changes recordA

 recordA.name = "Max Mustermann"

User Bob changes recordB

 recordB.name = "Maria Musterman"

Should be merged on database level without conflicts.

In inlang:

If one translator changes a variant in spanish and another changes a variant in german in the same message bundle auto merge should be possible - no conflict expected

2. Parallel creation of records should not lead to Conflicts

Example:

User Alice creates recordA

 recordA = new User("Max Mustermann")

User Bob creates recordB

 recordA = new User("Maria Mustermann")

we expect no conflicts on merge

In inlang:

Two translators a german and an english one should be able to add a message without a conflicts.

3. A deletion and a creation happening in parallel should not lead to conflicts

In inlang:

if a translator deletes a message (german one) while another inserts a spanish one

4. Parallel edits should lead to a conflict - with conflict marker on the specific record!

4.1. It should raise a conflict when editing the same property

Examples

User Alice changes the name property on recordA and recordB

recordA.name = "Max Moritz Mustermann"
recordB.name = "Maria Magdalena Musterman"

User Bob changes recordA and recordB on the same property

 recordA.name = "Max, Mustermann"
 recordB.name = "Maria, Musterman"

we expect two conflicts, one on recordA and one on recordB

4.2. It should raise a conflict when editing different properties

User Alice changes name on recordA

recordA.name = "Max Moritz Mustermann"

User Bob changes age (a different property) on recordA

 recordA.age = 25

We expect one conflict, on recordA - needs to be resolved by the application level
If two messages are edited in parallel each message should have its dedicated conflict marker

5. A conflicting file, merged by the default git tools, should be readable by our tools

If a merge leads to a conflict application level needs to be able to handle it. A conflict in json will lead to an unreadable file:

 {
    "id": "crazy_frog_jumping",
    "alias": {},
    "messages": [
      {
<<<<<<< ./left.txt
        "locale": "de",
=======
        "locale": "es",
>>>>>>> ./right.txt
        "declarations": [],
        "selectors": [],
        "variants": [
          {
            "match": [],
            "pattern": [
              {
                "type": "text",
<<<<<<< ./left.txt
                "value": "Guten Morgen"
=======
                "value": "Buenas dias"
>>>>>>> ./right.txt
              }
            ]
          }
        ]
      }
    ]
}

- if message file conflicts because of simulatanous edits fink should still function

Proposal

A first draft of a format definition:

Screenshot 2024-05-29 at 15.09.25.png

jldec commented 4 weeks ago

@martin.lysk1 Is there a difference between a conflict in vanilla git, and a conflict in lix? If so, do the requirements listed above apply to git conflicts or lix conflicts?

martin-lysk commented 4 weeks ago

lix conflicts and git conflicts are curently the same.

jldec commented 4 weeks ago

I think this relates to @jan.johannes feedback in this thread. If lix can remove any unwanted conflicts and find all missing conflicts automatically, then you could argue that the new format is not required in order to meet the requirements listed above.

samuelstroschein commented 3 weeks ago

Seeking alignment. Different paths exists to "solve" merge conflicts:

Ignore lix and build the inlang file format for git
1. risk of developing custom inlang merging solutions that don't generalize
Push merge resolution into lix already via api calls
1. risks merge conflicts if data is merged via git cli/outside the control of lix
Wait with merge conflict resolution (for non devs) until we have lixnet
1. degraded UX for everyone involved (devs, translators, designers)

I think I favor option 1 (ignore lix for now, design for git) while trying to delay "app scoped" merge resolution in inlang until users run into the problem.

The other options seems too early. Having a generalizable lix merge api seems out of scope as long as we have to be git compatible (aka as long as lixnet doesn't exist and is widely adopted).

@jan.johannes @martin.lysk1 @jldec do we have alignment for option 1 (basically what martin is exploring here)?

jldec commented 3 weeks ago

@samuel.stroschein I agree with the overall strategy (1.) of making the inlang file format work nicely with vanilla git.

My concern is that it feels like we are jumping directly into a new non-JSON format when a simpler JSON-based solution might cover many of the requirements above, like avoiding merge conflicts in most cases, and triggering merge conflicts in those cases we want.

You mentioned:

while trying to delay "app scoped" merge resolution in inlang until users run into the problem.

Do I interpret ☝️ correctly, that (for now) we should try to rely on developers to use available git merge conflict resolution tools, instead of extending the SDK to parse files with merge conflicts, and sending conflicts to the Fink UI?

If the simple solution doesn't work, I'll be fine exploring new formats as well, but I'd prefer if we validated that first. To help with that, I've adjusted the v2-persistence serializer to add linefeeds between bundles and messages in the JSON, so we can try out how that works with git merge conflict detection and resolution. The change is in the PR, screenshot added below (using GitHub)

Screenshot 2024-05-30 at 14 51 52

samuelstroschein commented 3 weeks ago

@jldec uh i like the idea of combining @martin.lysk1 idea with plain json but it seems like @martin.lysk1 approach of a file format that stores inline json is the way to go:

how do you deal with merge markers from git that automatically break json? Any JSOn containing merge markers in invalid json

{ 
<<<<<<<
  "hello" : "world"
-------
-------
  "hello" : "jurgen"
>>>>>>>
}

samuelstroschein commented 3 weeks ago

Do I interpret ☝️ correctly, that (for now) we should try to rely on developers to use available git merge conflict resolution tools, instead of extending the SDK to parse files with merge conflicts, and sending conflicts to the Fink UI?

Not quite. I am saying we should wait until users run into merge conflicts to see how we should design merge resolution [and not think about it now (except for minimizing merge conflicts in the first place)]

martin-lysk commented 3 weeks ago

I understand introducing another parser for a format and not storing json directly sounds a bit far off. I did want to solve the properties with normal json as well and also tinkered around with various kinds of json formatting variants like identation linefeeds and eventually ended up in git's source and i doubt we can tackle this with plain json.

If you want to tinker around and give it a further run i found it usefull to run gits merge directly on files in your cli instead of fighting json stings with replace operations.

Create a files like:

base.txt

``` { "id": "crazy_frog_jumping", "alias": {}, "messages": [ { "locale": "en", "declarations": [], "selectors": [], "variants": [ { "match": [], "pattern": [ { "type": "text", "value": "Good morning" } ] } ] } ] } ```

left.txt (translator adds an 'de' message)

``` { "id": "crazy_frog_jumping", "alias": {}, "messages": [ { "locale": "en", "declarations": [], "selectors": [], "variants": [ { "match": [], "pattern": [ { "type": "text", "value": "Good morning" } ] } ] }, { "locale": "de", "declarations": [], "selectors": [], "variants": [ { "match": [], "pattern": [ { "type": "text", "value": "Guten Morgen" } ] } ] } ] } ```

right.txt (translator adds an 'es' message)

``` { "id": "crazy_frog_jumping", "alias": {}, "messages": [ { "locale": "en", "declarations": [], "selectors": [], "variants": [ { "match": [], "pattern": [ { "type": "text", "value": "Good morning" } ] } ] }, { "locale": "es", "declarations": [], "selectors": [], "variants": [ { "match": [], "pattern": [ { "type": "text", "value": "Buenas dias" } ] } ] } ] } ```

and just run git merge-file -p ./left.txt ./base.txt ./right.txt

The three JSON's also outline one of the key properties we will need to solve here:

Parallel creation of records should not lead to Conflicts

This is a false positive that we should not have to deal with and we should no get merge conflicts with.

Please also comment the properties i outlineds in the top - i expect this list to be NOT complete yet...

jldec commented 3 weeks ago

how do you deal with merge markers from git that automatically break json?

For now, the sdk won't load broken JSON - If a merge was initiated locally, and the inlang messages file contains merge markers, then tools like paraglide would should show an error - and someone needs to resolve using non-inlang tools before the final merge can complete and the changes appear on the branch.

I'm not sure how fink users could get into this situation - if at all?

My gut says that if we do a good job avoiding merge conflicts, then this situation doesn't feel too unreasonable and other more-commonly used features like comments would add more value

martin-lysk commented 3 weeks ago

I'm not sure how fink users could get into this situation - if at all?

Its multiple properties (see description) but regarding conflicts the one that freaks me out the most is number 2.

Parallel creation of records should not lead to Conflicts

Its not fink that i am worried about - its parglide, and sherlock and devs that have to solve conflicts for translators - that should actually be no merge conflicts.

the result of the json's i pasted earlier looks like this, added '\n\n\n\n' won't change this.

{
    "id": "crazy_frog_jumping",
    "alias": {},
    "messages": [
      {
        "locale": "en",
        "declarations": [],
        "selectors": [],
        "variants": [
          {
            "match": [],
            "pattern": [
              {
                "type": "text",
                "value": "Good morning"
              }
            ]
          }
        ]
      },
      {
<<<<<<< ./left.txt
        "locale": "de",
=======
        "locale": "es",
>>>>>>> ./right.txt
        "declarations": [],
        "selectors": [],
        "variants": [
          {
            "match": [],
            "pattern": [
              {
                "type": "text",
<<<<<<< ./left.txt
                "value": "Guten Morgen"
=======
                "value": "Buenas dias"
>>>>>>> ./right.txt
              }
            ]
          }
        ]
      }
    ]
}

martin-lysk commented 3 weeks ago

Addition:

For now, the sdk won't load broken JSON

A reason why I like the outcome of the research so far more and more - it doesn‘t only load in conflicting state - it even gives the conflict marker a meaning within the format.

samuelstroschein commented 3 weeks ago

Its not fink that i am worried about - its parglide, and sherlock and devs that have to solve conflicts for translators - that should actually be no merge conflicts.

This is a good argument.

For things that are no merge conflict, no merge conflict should arise. If we need a custom file format for it, we probably should do it as two fink users shouldn't face a merge conflict that needs to be resolved by a dev for something that shouldn't be a merge conflict anyways.

Handing off "real" merge conflicts for developers to solve until we know how we are going to tackle merge conflicts i.e. in inlang apps or lix, is also reasonable.

jldec commented 3 weeks ago

a file format that stores inline json is the way to go:

@samuel.stroschein why do you think this is the way to go?

Wrapping bits of JSON inside a non-JSON file format implies that we lose the ability to leverage all the tools which already understand JSON to operate directly on stored files.

Little things that we take for granted like validation, syntax highlighting, jq, vscode language support, fast parsers in browsers etc. now will stop working or require that we run files through a preprocessor first, to extract the fragments.

To me, that seems like a high price to pay when we could solve issues like the lack of slots for untranslated messages in other ways, e.g. by adding those slots into the persisted JSON structure as part of normalization. (I'll push another change to demo this 🙂)

samuelstroschein commented 3 weeks ago

@samuel.stroschein why do you think this is the way to go?

Wrapping bits of JSON inside a non-JSON file format implies that we lose the ability to leverage all the tools which already understand JSON to operate directly on stored files.

If we don't find a way to make the merge conflict scenarios like 2 (Parallel creation of records should not lead to Conflicts) work with plain json, a custom format that stores json is the way to go.

the value of avoiding merge conflicts seems higher than syntax highlighting, vscode language support, etc. people shouldn't edit internal files anyways.

@jldec For now, the sdk won't load broken JSON - If a merge was initiated locally, and the inlang messages file contains merge markers, then tools like paraglide would should show an error -

the sdk should load files with merge markers if no merge conflict should have been raised (scenario 2 again, MESDK-114). a mix between "plain json" and fallback to a custom parser with merge markers could be the solution?

@jldec @martin.lysk1 can you research if a a json format is possible that avoids merge conflicts like scenario 2?

samuelstroschein commented 3 weeks ago

actually……. a .json ending that external tooling is operating on is out question anyways to minimize merge conflicts: external formatters/tooling will mess up the formatting of the json, leading to merge conflicts. we had this multiple times already.

(it can be a json under the hood but must not end with .json like ._json for example)

jldec commented 3 weeks ago

The latest commit in the v2-persistence PR stores JSON with spaces between messages and adds "slots" for untranslated langagues e.g.

[

{"id":"message_key_1","alias":{},"messages":[

{"locale":"da","slot":true},

{"locale":"de","slot":true},

{"locale":"en","declarations":[],"selectors":[],"variants":[{"match":[],"pattern":[{"type":"text","value":"Generated message (1)"}]}]},

{"locale":"es","slot":true},

{"locale":"fr","slot":true}]},

{"id":"message_key_2","alias":{},"messages":[

{"locale":"da","slot":true},

{"locale":"de","slot":true},

{"locale":"en","declarations":[],"selectors":[],"variants":[{"match":[],"pattern":[{"type":"text","value":"Generated message (2)"}]}]},

{"locale":"es","slot":true},

{"locale":"fr","slot":true}]}]

This format appears to prevent merge conflicts when multiple language translations are merged at the same time. Test files in the zip if you want to try.

temp.zip

martin-lysk commented 3 weeks ago

…sorry for the long reponse
TLDR; I propose to invest two days into the slot based storage for atoms and build a prototype using plain JSON. Again: Please let me know if i miss some important aspects

With the original proposal in the grapic above, I took inspiration form packfile format - that can be read comparable fast and allows to seek inside of the stream. But I consider this as a premature optimization (and therefore may even into the wrong direction).

With your input @jldec we can also model a pure json version of the slot file without the binary parsing that I described earlier.

[

{ "slotBegin": "1", "data":
{ "id": "crazy_frog_jumping" "name": "Max", "lastname": "Mustermann" },
"slotEnd": "1" },

{ "slotBegin": "2", "data":
null
"slotEnd": "2" },

{ "slotBegin": "3", "data":
null
"slotEnd": "3" },
...
null]

One important aspect of the slot approach - I didn't outlined in the video yet - is the separation of schema stored inside of the slot and the storage/slot managment themself.

Instead of using language tags to pre populate slots i would pre populate slots for objets of any kind - we would have to define on what level we want the file be mergable without conflicts in other words - at what level bundle/message/variants we want parallel inserts to work.

Why does this matter? Lets take the current iteration on the json here - I know this is just for tinkering around @jldec - but it outlines the problem pretty well:

The structure currently models multiple message in one file and uses slots for languages inside of messages:


[

{"id":"message_key_1","alias":{},"messages":[

{"locale":"da","slot":true},

{"locale":"de","slot":true},
....

This has two problems:

In case you add a new language to the project you would need to add a new language slot to every message and we would face the same merging issues again.
Adding a new message (two tanslator case) in parallel will lead to merge conflicts

@jan.johannes pointed to the fact that - in conflict state we can load the orignal left and right versions that don't contain conflict markers via lix and do the conflict resolution with two valid json files on application level. We will have to inform lix about our conflict resolution anyway to get the git index out of conflict state anyway - spinning another loop to get the left and right file states and marking those atoms as conflicting state in the application state (SDK) sounds way better than parsing conflict markers.

The central question for this topic is:

When is it ok to get into conflicting state in git because of inlang files.

So lets see when this could happen:

Only when the messages are changed:

Fink is open, translator has some uncommited changes, pulls the last state and a parallel edit that conflicts happend - sdk is running and can take care of the conflicts (using the left and right state from lix)
Dev edited messages in sherlock (so he use our tool in is his local env already!) and edited a message, pulls and gets a conflict - easy he uses the sdk has to open sherlock and sherlock should support him to fix it
A translator edits something, pushes changes on a separate branch - main or target repos state changed because someone else edited the same message. The pr in git hub detects a merge confilct - guess we would need to handle this in fink - link fink in the pr, allow to merge main in - getting into conflict state and fix the conflicts in fink.

To specify when conflicts are expected and when we have to avoid them we should think about Atomicity

Atomicity - MessageBundle vs. Message vs. Variant

Properties of an Atom

inserts of a new object in parallel should not lead to a conflict
parallel edits in distinct atoms should not lead to conflicts
parallel edits within one atom must raise a conflict (set git index into conflicting state) - to be solved within an inlang app

There are two ways to reach atomacity with the usual git client - separate topic just to mention it:

each atom can be stored within one file
an atom can be stored among other atoms within a slot file

Now that we defined properties and conflict behavior we can look at the our MessageBundle, Message, and variant and see what should be an Atom.

Atomicity on Message Bundle

Property 1
(inserts of a new object in parallel should not lead to a conflict)

The initial creation of a message bundle should not lead conflicts

Property 2
(parallel edits in distinct atoms should not lead to conflicts)

Edits in two independet message Bundles should not lead to conflicts

Property 3
(parallel edits within one atom must raise a conflict (set git index into conflicting state) - to be solved within an inlang app)

Edits on metadata (aliases, parameter order) should lead to a conflicts.
But Edits on distinct Messages (locales) should be fine.

Atomacy on Message

Property 1
(inserts of a new object in parallel should not lead to a conflict)

Yes, adding a new language - Message - to a MessageBundle should work, duplicate in languages should can dealt with by lint

Property 2
(parallel edits in distinct atoms should not lead to conflicts)

Yes, two translators should be able to work in different languages in parallel

Property 3
(parallel edits within one atom must raise a conflict (set git index into conflicting state) - to be solved within an inlang app)

Yes, crud opertations on one variant usually affects the meaning of other variants within the same messages. having two translators working in the same message should not be merged automatically and raise a conflict

Atomacy on Variants

Would say no - do anybody see a reason not to conflict when variants are edited in parallel?

With atomicty on Message Level we can do the math on a project like cal.com to check out our storage options:

\~2000 k messages times 32 Languages

File based atomicity:

If we store one file per atom we would store \~64k files

does lix especially lazy loading and git cli scale for such a case?
How does github behave on a commit when one adds a new language using maschine translate - 2000k files.

Slot based atomicity:

With a slot files we could distribute the messages over \~32 files (with 1000k slots each)

samuelstroschein commented 3 weeks ago

@jan.johannes good. then we are all aligned that we "ignore lix for now and design the inlang storage format for git" for now.

jldec commented 3 weeks ago

Agree. 👍
Looking forward to seeing a nice clean outcome for users that minimizes merge conflicts.
Also looking forward to the day when we get change tracking for a DBMS like sqlite instead..
In the meantime let's keep things simple 😎

jldec commented 3 weeks ago

example of sqlite-as-a-file usage in the wild
https://youtu.be/QiocnnlcXIU?si=s2A7jCmsj4ba0mF5&t=71

jldec commented 3 weeks ago

@jan.johannes in this case they treat the sqlite file as a build artifact, but I don't think it would be a huge leap for them to treat it as source as well, if we gave them an easy, git-aligned way to track changes to content inside the file.

I thought it was interesting because it treats a database like just another file, instead of the usual long-lived db as-a-service thing.

samuelstroschein commented 3 weeks ago

I watched the latest loom https://discord.com/channels/897438559458430986/1084748995349446667/1247896920966303814.

@martin.lysk1 did you do research on how to build an index? Looping over all possible files and message, potentially even variants, to perform CRUD seems to be O(n^2) or even O(n^3). Seems like we need indices. So yes, we are building a git optimized database :/ But it seems to be the easiest way forward. So, I am OK with it until lix matures.

opral / inlang-message-sdk