mastodon-sc / mastodon-tomancak

BSD 2-Clause "Simplified" License
1 stars 5 forks source link

See if git can be used to collaborative tracking Mastodon projects? #17

Closed maarzt closed 9 months ago

maarzt commented 1 year ago

Is it possible to use git as a backend for mastodon-sc/mastodon-git#12

Clarify the following questions?

maarzt commented 1 year ago
  • Will git merge / git cherry pick / git rebase produce a corrupted Mastodon datasets is the an error message?

No they will just complain about conflicting binary files.

  • Is it possible to use a costume merge tool with git?

Yes, gitattributes can be used to do this. https://stackoverflow.com/questions/12356917/how-to-set-difftool-mergetool-for-a-specific-file-extension-in-git

maarzt commented 1 year ago
  • Do we need to be careful about file sizes and repository sizes when versioning Mastodon projects with git?

Git is known to perform poorly on large binary files. The biggest mastodon file, seen by me so far, has 50 MB. With roughly 400.000 spots. Storing only 1000 version of that file, without delta compression would produce 50 GB. Making a copy after 10 added spots would lead to a theoretical 1 TB. The recommended size for a git repo is less than 1 GB.

Also the maximum file size in git is 100 MB so, we will soon reach this limit. There exist several solutions for storing large files in git:

The Mastodon file format is not friendly with regard to delta compression. I did an experiment, opened a large dataset. Saved it to a.mastodon, added a spot, saved it to b.mastodon. I uncompressed the mastodon files. And compared the model.raw file between the two. There are more than a 100_000 bytes different between those two files. A would expect maybe 1000 bytes. My conclusion, and also verified using vbindiff comparing the two files: image Most bytes that are different between the two files are probably indices and not actual data. That is likely because Mastodon uses ObjectOutputStream to save the graph.

Conclusion: Mastodon storage file format could be hugely improved in terms of "delta compression friendlyness". Splitting the files into blocks would further reduce load on git. We wouldn't even need to use git LFS.

Is a specialized Mastodon file format require?

A specialized file format, would greatly reduce the need for bandwidth, storage requirements and git LFS. Probably improve performance and offline availability.

SVN is another alternative, but has it's drawbacks of been centralized. An not offline capable. Slightly different approach to branching.

maarzt commented 1 year ago

Open question: