sosuisen / git-documentdb

Offline-first Database that Syncs with Git
https://gitddb.com/
Mozilla Public License 2.0
46 stars 1 forks source link

What does the data format look like? #36

Open chmac opened 1 month ago

chmac commented 1 month ago

Just discovered this by searching for "git crdt". I've been dreaming about this kind of git based approach to data storage for some time now, great to discover somebody has already solved many of the challenges!

What does the data look like inside the git repo? Is there any kind of example repo that shows the data somewhere?

sosuisen commented 1 month ago

Hello!

This is a sample repository. https://github.com/sosuisen/sosuisen-my-inventory/tree/a7f05b51348746664cc185c78c3f7341343fb0b2

There is only one leaf. Commits and merges are made under special constraints.

serial_consistency_constraints_rightside

Please try a quick example. https://github.com/sosuisen/git-documentdb/tree/main/examples

An example of synchronization is "sync". $ npm run sync

Many comments and the flow of the process of this example are here. https://github.com/sosuisen/git-documentdb/blob/main/examples/src/sync.ts

The theory is described below: https://gitddb.com/blog/is-git-crdt https://gitddb.com/blog/how-to-use-git-as-an-offline-database

The basic idea is simple, but there are many practical problems that make it difficult.

chmac commented 1 month ago

Thanks for all the links. I've read the blog post and I generally understand the concept. I also looked over the example sync file. But I'm still not clear on what the box, item, and work folders contain. Can you shed any light on that? Or is there some other code I can read to understand what is stored in these folders?

For additional context, by now I have built several data storage projects on top of git. I'm quite familiar with CRDTs and the theory behind all this. I'm really just curious about how you have solved the specific issues between git commits, merges, and so on. For example, one approach would be to simply store CRDTs in git, each in a separate file, named with a UUID, and then parse them all every time. This should guarantee no merge conflicts. But I imagine there are better strategies than this, and I assume you're doing something more elegant than this. :-)

sosuisen commented 1 month ago

box, item, and work are sub-directories. In git-documentdb, sub-directories are treated as collections that categorize documents.

They are data for an app called inventory-manager. Below, an instance of a collection named item is created and assigned to itemCollection.

https://github.com/sosuisen/inventory-manager/blob/0a52f24ed54966b8d66f605a778f922c476980cf/src/main.ts#L166-L167

  boxCollection = await inventoryDB.collection('box', { namePrefix: 'box' });
  itemCollection = await inventoryDB.collection('item', { namePrefix: 'item' });

New documents are added to itemCollection using the put method.

https://github.com/sosuisen/inventory-manager/blob/0a52f24ed54966b8d66f605a778f922c476980cf/src/main.ts#L415-L420

        itemCollection
          .put(command.data, {
            enqueueCallback: (taskMetadata: TaskMetadata) => {
              resolve(taskMetadata);
            },
          })

The documents under itemCollection are typed by using TypeScript. https://github.com/sosuisen/inventory-manager/blob/0a52f24ed54966b8d66f605a778f922c476980cf/src/modules_common/store.types.ts#L23-L29

 export type Item = {
  _id: string;
  name: string;
  takeout: boolean;
  created_date: string;
  modified_date: string;
};

The git-documentdb does not have a schema, but the documents under the item directory appear to have types because of TypeScript.

https://github.com/sosuisen/sosuisen-my-inventory/blob/a7f05b51348746664cc185c78c3f7341343fb0b2/item/id-1ikt_VnC5dUgd-kNuZ3_.json

 {
  "created_date": "2021-02-14 09:38:22",
  "modified_date": "2021-02-14 09:38:22",
  "name": "黒田日出男, 洛中洛外図・舟木本を読む",
  "takeout": false,
  "_id": "id-1ikt_VnC5dUgd-kNuZ3_"
}

I hope this is helpful.

chmac commented 1 month ago

Thanks for taking the time to answer my questions, I appreciate it.

Unfortunately, I'm still not getting it. I see your examples. They seem to be examples of how to use the API. My goal is to understand how the API results in changes on disk.

Have you written a custom git merge? So that merges are always successful? I saw that there's a "last write wins" approach. But I'm not clear how these API calls result in git commits, and how those commits get merged.

For example, could I break the document db by pulling it in a terminal, editing some files, committing and pushing?

I see the work folder which has just 1 file, user01.json. If I update the user object, will that file be updated with the latest data?

Maybe I've totally misunderstood this project. I thought it was an implementation of CRDT like functionality on top of git. So I imagined that I could create a collection, push to it from multiple machines, and then I could use a standard git merge, be guaranteed there were no conflicts, and then my data would be updated. But maybe this is incorrect?

sosuisen commented 1 month ago

Indeed, there might be some misunderstandings.

This project is a wrapper for Git, but there are specific restrictions to achieve CRDT.

First, CRDT cannot be achieved by simply using Git freely; certain rules are necessary. In this project, I describe the rules that humans typically use to assess and operate Git synchronization with the goal of achieving CRDT. This project automates Git by implementing these rules, including conflict checking. If a conflict arises, an automatic resolution is performed.

All possible scenarios in Git synchronization are documented as rules. You can find rules 1 to 4 here: How to Use Git as an Offline-First Database.

Second, this project aims to utilize Git through the API of a document database. Therefore, only JSON or Markdown with Front Matter can be stored in the Git repository.

Members within JSON are evaluated for merging individually. The 3-way merge of JSON arrays or objects is implemented using ot-json1 (This will probably confuse you even more..).

For lengthy texts, a character-level merge strategy is employed, generating a single text that combines both lengthy texts.

In the event of conflicts, the default strategy is "last write wins."

Finally, git-documentdb is an API. Therefore, it must be incorporated into your program. The put API creates a document and commits changes simultaneously. The synchronization timing is determined by the program you created. You can manually modify files in the repository, but you must commit them by hand. The commits are automatically pushed when the program calls the synchronization API.

chmac commented 1 month ago

Okay, now I understand better, thanks.

This package must handle all the merging so that it can apply its own merge logic. I see. The reference to ot-json1 is helpful, I guess your API results in creating ot-json1 operations which in turn get applied to whatever is in the JSON that's stored in git.

So the real history of what has changed is visible inside git also. It's an interesting approach. I had assumed that handling merges would be too complex when I designed my systems, but perhaps this was a mistake. I will contemplate further on this.

Thanks again for taking the time to explain the code. I'm really fond of storing my personal data inside git, I find it to be a great tool for transparency, easily self hosted, and so on. I'll consider if moving some of my applications onto this package makes sense. Or maybe building new ones with it.

sosuisen commented 1 month ago

I also love managing personal data with Git. I would be delighted if you used my project, and I also welcome you to create something new!