Preventing out of control DB growth

Aside issues of garbage collecting which discussed in #92, we have a bigger problem.

A repo with about 1000 files (with short names) has a commit value size of about 340 kB. Doing

ali touch newfile
ali commit -m "msg"

adds new commit value to DB with about 340kB+.

So our commits are large and keep growing with repo sizes. Leading to DB size growth which is faster than stored files size.

There are also additional GC issues: We add to be GCollected size with obsolete value of large sizes, since every touch by itself creates the HEAD type commit (of above 340KB size). It is erased with new touch but stays in DB until GCollected.

Also commit reputs a lot of tree./filename type keys in the DB as well objects.2W9rNbTusHTTf... seemingly related to a filename. So as number of files grows the number of key which are reput in DB growth. Thus we keep staffing DB with values to be GCollected.

All of these leads to exponential growth of DB with every file addition even if we do GC.

We probably need to think about better commit format which reflects only difference in states.

The situation is a bit more complicated. I will try to explain.

I assume you created the 1000 files in the root directory. The reason you see the size increase is that the root directory was again modified in the next commit was copied therefore. This includes the reference to all 1000 + 1 children. This is simply how a Merkle-Tree (or MDAG in our case) works. When creating the 1000 files in a sub-directory and creating newfile the new commit would be smaller, since we just have to update the root directory and not the big sub-directory. So, this is a pretty pathological example.
The point about GC is correct and this kinda is a problem. But the solution is not to store diffs, because that does not make sense for brig which is a tool that stores snapshots. The issue you actually want to solve is that we don't create so many copies of the big HEAD commit. git has this kind of problem too, since they use the same data structure. Their solution is to use a special staging area (ever wondered why they did that?) which is not stored in the MDAG, but separately. Changes just get overwritten and get packed up for the HEAD commit once. But also git performs not perfectly when creating 1k empty files. It stores over 4MB of data in my test case. When additionally creating a new empty file per commit it also grows roughly 100K per commit. Not as much as we do, but the growth curve will be the same.
Contrary to git we don't have to store indefinite history. We only have to store until a certain depth where we still might have a change to recover the file at that version. I have to say though that the core of brig should be robust to easily store more metadata than we have.

So, what can we do? The root issue is not the size of the serialized commit in the database, it's that we create so many of them during normal operations. git solves this with a staging area, which I initially wanted to avoid since it's a special concept that's often a source of confusion to git beginners. Back then I opted out to make CURR an ever changing commit that reflects what would be HEAD when someone types brig commit. I think that's a good concept, just the implementation of it sucks (well done @sahib!). I have to do some more research, but right now I see the following solutions:

Implement a staging area like git. I don't really like it, but it's an option. Especially it involves touching pretty much... everything. Also makes things harder because sync has now to care about "staged files" and "files in older commits". That's annoying enough in git and does not deserve copying.
Implement the computation of Status()lazily. core.StageNode() should not modify the DAG anymore, but every modification is written to a special staging-like area in the database. Status() would on-the-fly compute a commit object out of that and present it to the rest of the application. This would trade a little computation and some extra logic to less storage.

All in all I would argue this can wait to 0.7.0, since it's "just" an inefficiency. Long-term we need a solution.

Hope that brings some clarity.

Thanks. Now I understand why git has stage area. I was always puzzled by the its presence. I think this is the nicest explanation I had.

Now let me repeat back the main point (please correct me if I am wrong):

commit is the full state of the repo. I.e. I can delete all commits but one, and I would have a functional file tree corresponding to that commit. Right?
also on a reading your explanation second time: commit also reuses already stored hashes of directories which did not change. So the biggest problem is directories with large number of files (like Maildir).

If I understood it correctly, we follow the git way: commit is a state of the tree not a change set (as I was naively thinking before you pointed a while back a nice reference to it). Diffs are calculated not stored.

But this way maybe a wrong way, git has issues with big file trees, but they had excuse, it is not designed to be a FS. Though people abuse git speed and use it with insane amount of files and then complains about slowness.

I agree this is fight for v0.7.0. But let's have a draft of a solution.

I also think that staging area is bad idea
Lazy computational Status() suggested by you good option.
3rd. I think we discussed it in application to sync. But maybe it is good to revisit: commit is a diff (i.e. one of create, remove, move, modify). This way it has manageable size, and has very little overhead on construction and storage. To get current Status, we can play forward all diffs or/and have intermediate check points (what we currently call commits) from which we can play forward to get to a state. The new commit is just: hash of the last diff and hash of the start check point, so we can store them in the linked list format.

Downside of this approach: a slower startup (could be fast with check points)

Now let me repeat back the main point (please correct me if I am wrong): [...]

Both points are correct. I think this drawing illustrates it well:

alt

If I understood it correctly, we follow the git way: commit is a state of the tree not a change set (as I was naively thinking before you pointed a while back a nice reference to it). Diffs are calculated not stored.

Also correct. Although it's not the git way, even if it's the most prominent example of this technique. There are many, many other tools using this, including mercurial, restic and obviously ipfs. Restic is a good use case since they are showing that handling large amount of files works reasonably well with a MDAG (if you wonder if they have staging area: no, because they commit many files at once - it's a backup program). All in all it's a well understood technique.

But this way maybe a wrong way, git has issues with big file trees, but they had excuse, it is not designed to be a FS. Though people abuse git speed and use it with insane amount of files and then complains about slowness.

The reason git is slow for big files is that it tries to compute diff, even for large binary files. The solution there is to use Git LFS, which does something very similar to brig: Just store pointers to the file content and version that. My point here is that the slowness is not coming from the use of a MDAG, but rather from storing the files themselves in it and diffing them.

3rd. I think we discussed it in application to sync. But maybe it is good to revisit: commit is a diff (i.e. one of create, remove, move, modify). This way it has manageable size, and has very little overhead on construction and storage. To get current Status, we can play forward all diffs or/and have intermediate check points (what we currently call commits) from which we can play forward to get to a state. The new commit is just: hash of the last diff and hash of the start check point, so we can store them in the linked list format.

That's roughly what projects like Pijul or Darcs are doing. Both are interesting, but never found wide-spread adoption. Especially darcs hit issues with real-world usage and I consider Pijul still more as a research project. Very interesting, but definitely not as well understood as MDAG. We simply might run into other problems with this approach.

Also remember switch to a diff storage approach would essentially mean a rewrite of large parts of brig. Bit much to solve the original problem stated above. :smile:

sahib / brig

Preventing out of control DB growth #98