sahib / brig

File synchronization on top of ipfs with git like interface & web based UI
https://brig.readthedocs.io
GNU Affero General Public License v3.0
567 stars 33 forks source link

Preventing out of control DB growth #98

Open evgmik opened 3 years ago

evgmik commented 3 years ago

Aside issues of garbage collecting which discussed in #92, we have a bigger problem.

A repo with about 1000 files (with short names) has a commit value size of about 340 kB. Doing

ali touch newfile
ali commit -m "msg"

adds new commit value to DB with about 340kB+.

So our commits are large and keep growing with repo sizes. Leading to DB size growth which is faster than stored files size.

There are also additional GC issues: We add to be GCollected size with obsolete value of large sizes, since every touch by itself creates the HEAD type commit (of above 340KB size). It is erased with new touch but stays in DB until GCollected.

Also commit reputs a lot of tree./filename type keys in the DB as well objects.2W9rNbTusHTTf... seemingly related to a filename. So as number of files grows the number of key which are reput in DB growth. Thus we keep staffing DB with values to be GCollected.

All of these leads to exponential growth of DB with every file addition even if we do GC.

We probably need to think about better commit format which reflects only difference in states.

sahib commented 3 years ago

The situation is a bit more complicated. I will try to explain.

So, what can we do? The root issue is not the size of the serialized commit in the database, it's that we create so many of them during normal operations. git solves this with a staging area, which I initially wanted to avoid since it's a special concept that's often a source of confusion to git beginners. Back then I opted out to make CURR an ever changing commit that reflects what would be HEAD when someone types brig commit. I think that's a good concept, just the implementation of it sucks (well done @sahib!). I have to do some more research, but right now I see the following solutions:

All in all I would argue this can wait to 0.7.0, since it's "just" an inefficiency. Long-term we need a solution.

Hope that brings some clarity.

evgmik commented 3 years ago

Thanks. Now I understand why git has stage area. I was always puzzled by the its presence. I think this is the nicest explanation I had.

Now let me repeat back the main point (please correct me if I am wrong):

If I understood it correctly, we follow the git way: commit is a state of the tree not a change set (as I was naively thinking before you pointed a while back a nice reference to it). Diffs are calculated not stored.

But this way maybe a wrong way, git has issues with big file trees, but they had excuse, it is not designed to be a FS. Though people abuse git speed and use it with insane amount of files and then complains about slowness.

I agree this is fight for v0.7.0. But let's have a draft of a solution.

Downside of this approach: a slower startup (could be fast with check points)

sahib commented 3 years ago

Now let me repeat back the main point (please correct me if I am wrong): [...]

Both points are correct. I think this drawing illustrates it well:

alt

If I understood it correctly, we follow the git way: commit is a state of the tree not a change set (as I was naively thinking before you pointed a while back a nice reference to it). Diffs are calculated not stored.

Also correct. Although it's not the git way, even if it's the most prominent example of this technique. There are many, many other tools using this, including mercurial, restic and obviously ipfs. Restic is a good use case since they are showing that handling large amount of files works reasonably well with a MDAG (if you wonder if they have staging area: no, because they commit many files at once - it's a backup program). All in all it's a well understood technique.

But this way maybe a wrong way, git has issues with big file trees, but they had excuse, it is not designed to be a FS. Though people abuse git speed and use it with insane amount of files and then complains about slowness.

The reason git is slow for big files is that it tries to compute diff, even for large binary files. The solution there is to use Git LFS, which does something very similar to brig: Just store pointers to the file content and version that. My point here is that the slowness is not coming from the use of a MDAG, but rather from storing the files themselves in it and diffing them.

3rd. I think we discussed it in application to sync. But maybe it is good to revisit: commit is a diff (i.e. one of create, remove, move, modify). This way it has manageable size, and has very little overhead on construction and storage. To get current Status, we can play forward all diffs or/and have intermediate check points (what we currently call commits) from which we can play forward to get to a state. The new commit is just: hash of the last diff and hash of the start check point, so we can store them in the linked list format.

That's roughly what projects like Pijul or Darcs are doing. Both are interesting, but never found wide-spread adoption. Especially darcs hit issues with real-world usage and I consider Pijul still more as a research project. Very interesting, but definitely not as well understood as MDAG. We simply might run into other problems with this approach.

Also remember switch to a diff storage approach would essentially mean a rewrite of large parts of brig. Bit much to solve the original problem stated above. :smile: