Repository size will grow too much due to compressing textual files

jstammi commented 2 years ago

Git is different from other version control systems: on cloning the repository you always obtain the complete repository, including all commits/versions of all files ever contained. Git never forgets. Contrary to e.g. svn with only obtaining a snapshot of all files at a chosen changelist.

Git internally tries to store only the delta of a file to it's previous version. This will fail for pre-compressed file to a very high degree. Consequently the repository actually will grow every day by the sizes of the XZ-files, actually about 65M per day. And the repo will become very un-handy within some weeks.

Rule of thumb for git repositories (afaik): avoid high frequently changing binary files.

For the csv files one should simply remove the compression. As only single lines will change per day. Resulting in a very low growth of the repository.

I guess the lines fo data within the .fasta file will never change again. If so one should separate them to files on a per day basis. Or a directory per day with one file per IMS_ID. And as the data does not change again, one can stay with it's compression.

cuehs commented 2 years ago

Hi @jstammi the current configuration is a trade off between ease of use and repository size. Also git stores snapshots not diffs (https://willi.am/blog/2014/10/14/for-the-last-time-git-stores-snapshots-not-diffs/).

If you are concerned about the repo size on you device I suggest to make a shallow clone git clone --depth=1. A word of warning, the fasta file is around 15gb without compression.

corneliusroemer commented 2 years ago

I think what @jstammi says is not correct. I also first had these concerns, but if you check the size of commits, they are very small. xz compression seems to be deterministic and/or incremental, meaning earlier blocks don't change.

See here for a recent commit, the changes are small. If this wasn't the case, the current configuration would have already broken with ~100MB being pushed every day for 5 months.

https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland/commit/76e4e0cd88888d4501150ef9a5f64ae4653b3e4c

This issue can thus be closed to keep things light ;)

jstammi commented 2 years ago

The commits of the xz files is small only as long as you do append lines ONLY to them (uncompressed of course). In that special case most of the binary stays stable, indeed (trailing bytes of the previous version change only, most probable some footer data). I was not aware of this before and did not check thoroughly enough before reporting, sry about that.

The moment that ONE byte changes, the complete binary xz file changes and it's size adds up to the repository size. AFAIS independent if the change happens in the beginning or the end of the file - at least my quick test uppercasing a char early vs late in the csv showed up like this **).

I did not check for how often something changed in the csv/fasta up to now. So it is up to you, those that know the data in more detail and who can judge on this (implicite?) assumption that you made on the files usage, if you want to keep with this approach. Me personally having seen a lot of degraded git repositories already, just because of such usage beyond it's concepts, can warn only. And I stay with my initial recommendation.

And btw, why it is a "good compromise"

to have to uncompress a file before being able to use it's contents vs being able to access the contents directly?
to have to uncompress all data until the line(s) being interested in for the 13G (uncompressed) FASTA file?

**) Testing each time with fresh cloned repository:

Test 1: check that my way of xz re-compressing does not change the file - success, 'git status' shows no modifications afterwards.

Test 2: check modification early in the file

uncompress SARS-CoV-2-Entwicklungslinien_Deutschland.csv
uppercase "hash" to "Hash" in line 3
compress again
commit
clone the modified repo again
'.git' dir size increased by 9.644.811 bytes from 375.149.073 to 384.793.884 (SARS-CoV-2-Entwicklungslinien_Deutschland.csv.xz size 9.641.184)

Test 3: check modification late in the file

same way as above, but uppercasing 'alleles' to 'Alleles' in line 425341
'.git' dir size increased by 9.645.067 bytes from 375.149.073 to 384.794.140

robert-koch-institut / SARS-CoV-2-Sequenzdaten_aus_Deutschland

Repository size will grow too much due to compressing textual files #8