Open tvandijck opened 9 years ago
Oh and sorry.. to clarify, if the scheme is set to zlib, or some other compression library, obviously the file in the lfs folder is compressed using that scheme, and probably is that way on the server as well. the sha256 is probably that of the compressed data...
We may not even need to add this to the pointer files if all we're doing is compressing them in the .git/lfs/objects
directory. As long as we can verify that the inflated object still matches the OID, we should be fine storing the compressed version there, and unzipping it in the git lfs smudge
command.
Maybe even store the file with an extension, so the smudge command can tell if the file is compressed or packaged in any way, like:
.git/lfs/objects/ab/cd/some-oid.gz
Storing compressed objects on the server is tricky though. Is zlib fully deterministic? It wouldn't be a good idea to go with a non-deterministic compression algorithm, otherwise slight changes between Git LFS clients or compression libs can cause the pointer file to change without changing the contents of the actual file.
We can also probably add gzip encoding to the API somehow to save on transfer times.
Thanks for filing this, I didn't know about these common binary files that are very compressible.
I see, so maybe the sha256 needs to be calculated pre-compression? I'm not sure if zlib is fully deterministic although I can't really imagine why not.. but from version to version things might change a little, so that would still cause problems.... doing the sha256 pre-compression would bypass that problem though.
that said, a lot of content/media files are in fact just raw data. Some are .dll or .exe files which typically also compress 30-40%, so there is big savings to be had by compressing the on disk representation.
I really like the idea of compressing objects in .git/lfs
, if they're a format where compression makes sense. Compression for the transfers also makes good sense. For server storage, though, I think it makes more sense to leave that up to the server implementation and keep that out of the api and pointer files.
my idea for adding it to the pointer was that in the future it could add 'other' compression schemes without breaking backwards compatibility... But technoweenie's idea of just adding a .gz extension achieves the same result, and is probably easier to implement.
Compression for the transfers also makes good sense.
This may be tricky for Git LFS implementations that directly use S3. But that's fine. Progressive enhancement is a big goal with the Git LFS API. We may need to figure out a way for the server to tell the client that it's ok to send the content gzipped. Downloads can make use of the Accept-Encoding
header.
I'm not sure if zlib is fully deterministic although I can't really imagine why not..
I'm certainly no expert on compression libs, but I think they're mostly non-deterministic, and probably for speed. As long as compatible tools can read/write the content, and the content is preserved, determinism isn't important for a lot of compression uses.
This lzham algorithm has a deterministic mode though:
Supports fully deterministic compression (independent of platform, compiler, or optimization settings), or non-deterministic compression at higher performance.
Another data point is that we get about 80% savings when compressing .pdb files.
It seems like there are 2 ways to go here.
The first way keeps the overall commit SHA normalized and independent of the compression results.
If we then let the pre-push hook do the compression, the upload POST can include the original {SHA, size} plus some scheme info, such as {scheme-name, compressed-SHA, compressed-size}. And then have the server store all 5 fields.
The server would NOT need to understand the compression schemes; but it could "shasum" verify the compressed-SHA/compressed-size of the uploaded data if it wanted to.
The server could then always address the file(s) by the original content SHA. If different clients have different compression schemes, the server could allow each uniquely compressed version to be uploaded. (And only give a 202 if "SHA.scheme-name" or "SHA.compressed-SHA" is not present, for example.)
A subsequent GET API operation on the SHA could return an augmented "_links" section with a "download" section for each compressed variant that the server has. The client would be free to choose which variant to actually download (based upon "scheme-name" or "compressed-size" or whatever).
This would also let the server, if it wanted to, do a single server-side compression for raw data (filling in the 5 fields for the server-created variant, if you will). This would avoid the need/expense of doing the compression on the fly for every request.
The problem with storing compression info in the pointer is that most compression algorithms are non-deterministic. Slight variations in compression algorithms could change the compression info in the pointer, causing it to change for different users even if the content doesn't change.
I think we can almost support server side compression now. There's no requirement telling how servers should store data, just that they should accept and serve objects according to the SHA-256 signature of the original data. I think it would just have to signal to the client that it accepts compressed contents (similar to Accept-Encoding
, but in the _links
hypermedia section).
Yeah, I think it's better to not store compression info in the pointer file. Keep them normalized to reflect the original media file.
Should this request be broken into 3 parts?
In my view all 3 are useful, but discrete, enhancement requests. Each has benefits and costs Disk space is (relatively) cheap these days and my personal interest is in (2) as I'm stuck with a relatively slow link to the server, and many highly compressable, lfs tracked, files. But I can see value in (1) and (3) too. I'd argue that the OID should be of the uncompressed data - fsck would have to uncompress each file to validate the OID - but the OID would be constant no matter which combination of the above 3 options a given client-server pair implemented.
For the above options:
where compression algorithm may be no compression in any of the above. Clearly using the same compression algorithm at each stage would give performance benefits and keep the implementation simpler.
I don't believe that the above would require a deterministic compression algorithm either - we are always storing the OID of the uncompressed data
Different users may choose different combinations. e.g. I may choose compression for 1 & 2 but no compression for 3 on my desktop (to give faster checkout say) yet choose compression for 1,2 & 3 on my laptop with less disk space (and checkout will take longer).
+1 A compression of any kind would be nice. We have in our repos .sql files with a few hundred mb's. With a compression the size could be reduced about 95%. Is somebody currently working on this feature? Greetings Jan
For the client/server storage piece, is this something that could be delegated to a filesystem that supports transparent compression? Or is that just passing the buck and would generate other performance issues? ZFS, BTRFS, NTFS??
Maybe it would depend on the filesystem? We could compress the data before sending it over the wire, and store it compressed at the long end, but that makes doing binary diffing tricky and byte ranges unrealistic.
A following diff is certainly a problem, but sending the data compressed should be default feature. In my opinion upload speed vs. data storage is the point. Sending an uncompressed 1gb file over the wire is a lot slower (depending on the connection) than an 50 mb file. By compressing the files (or a selected group like *.sql), the lfs storage would provide us an alternative. Is it possible to pack the files just for the upload and unpack it on the server side? Would that be a compromise?
I think that feature would make so much sense, I'm actually very surprised this isn't in LFS yet.
+1 ;-)
Adding this to the roadmap.
nice ;-)
I agree with an above comment by @stevenyoungs that this issue could do with decomposition for the roadmap.
I am also cautious about implementing LFS-specific compression for both disk-storage and over-the-wire transfers, if it can be shown that there are reasonable options available for having underlying infrastructure or protocols provide this: filesystems with compression support and protocols with deflate support, HTTP.
The link to the roadmap in @technoweenie is dead, and I can't find anything about compression in git-lfs. What is the status of this feature?
I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.
I'll reopen this as a way for us to keep track of it.
I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.
I'll reopen this as a way for us to keep track of it.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?
Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?
Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now? Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.
Thanks for info! (y). Is it reasonable if we could split over-the-wire transfer (as described @stevenyoungs ) in separate enhancement issue? Thing is that for me that would be enough, and it wouldn't require 3rd party extension (which could be dangerous for repo). Gains would be:
From how I see that could be implemented is that we simply specify file extensions which can be compressed
Feel free to open a new issue for the transport level compression.
Feel free to open a new issue for the transport level compression.
Thanks! Created issue #3683
Not having built-in support for compression in Git LFS is a real drawback IMO considering many large files do compress very well e.g. libraries, game assets (like geometry), databases, etc... When you have a few GB of LFS files, it makes a real difference.
I spent quite a bit of time experimenting with Git LFS Extensions to add compression:
[lfs "extension.gzip"]
clean = gzip -n --fast
smudge = gunzip
priority = 0
That sounds like a simple solution that ought to work out of the box on Linux, macOS and other *nix - for Windows, people can always use Windows Subsystem for Linux. Seems like only gzip
is sufficiently popular to be installed by default on all these platforms.
There are likely 1000+ compressed LFS objects in my test repo. I just discovered that for 3 of them, the Git LFS pointer differs when generated on macOS vs Linux. It's because the underlying call to gzip
doesn't return the same output. It works for 99.9% of the files except these particular 3!
So using gzip
goes out the window and I can't think of any alternative that works out of the box.
In any case, after a few days of testing, using a Git LFS extension is also impractical because users have to pay attention to clone with GIT_LFS_SKIP_SMUDGE=1
, then edit the .gif/config
and finally check out master
, otherwise everything fails with obscure errors.
TLDR:
git lfs dedup
that's even better for on-diskWould I be wrong in interpreting the above comment as: if we had a platform-agnostic, deterministic compression tool that we could ship alongside the installation of the lfs client binary, this feature would be trivial to support? (E.g. a binary named "lfs-gzip" based on a popular cross-platform implementation?)
That said there might be dangers in using gzip, as it isn't guaranteed to be deterministic in compression, only in decompression? https://unix.stackexchange.com/a/570554
I'm leaning towards using a different algorithm, one that would compress but is also deterministic somehow. But having just done some quick Google searches, I'm not seeing any popular algorithms or implementations specifically designed to be deterministic.
We could probably build or choose our own implementation and if cross-platform enough, call it deterministic as it's the only implementation in use. Pako is a JS re-implementation that would be cross-platform, Go would probably also have an implementation with very few platform dependencies to get in the way. Or we could always pick a particular gzip implementation and ship cross-platform builds from it, that way the likelihood of getting a different result on a different platform is greatly reduced.
I'm not sure there's a way to provide a foolproof guarantee that the same inputs produce the same results unless we manually review the compression algorithm and its implementation for likely non-deterministic behaviours though.
I'll ignore for a moment that all programs execute in a non-deterministic fashion due to the many possible errors and variances that could occur, as that isn't really practical to consider here. After all, if the result differs one time in a billion, all that happens is additional data is stored... right? Unless the gzip process corrupts the data in storage, but we could maybe optionally add a validation step after compression. That might be a good option to provide, if not already built in to a particular gzip binary.
This would be substantially easier if we had a deterministic algorithm, yes. We can't just pick a single implementation and always ship using that because of security updates, so we'd need a specification of how to write a simple, unchanging byte stream.
I am also not aware of any such algorithms, because typically with compression one wants very good compression, so most implementations are focused on improving performance rather than sacrificing performance for unchanging output.
I don't understand why a deterministic compression algorithm is nessesary. If the pointer files remain transparent to compression, and contain the SHA256s of the blobs' uncompressed contents (similar to how git blob objects are specified in git tree objects), then it shouldn't matter if the compression is deterministic, only if it's compatible. The actual object files in the LFS directory would also be named according to the hashes of the blobs' uncompressed contents, even though object files themselves would be compressed (just like how loose git objects are stored). Then, as long as the file can be uncompressed, everything is fine: the object files can be identified from the pointer file, and the contents can be retrieved.
Most compression algorithms may not be deterministic, but most certainly are backwards compatible. Git itself uses zlib for this exact purpose (see the Object Storage section of Git Internals - Git Objects).
Yes, there are two possibilities here. One is that we can store the compressed blob on the server and identify it by the hash of the compressed object, and the other is that we can store it compressed on the client but using the hash of the uncompressed object.
We've considered doing the latter, but it will require that we store data in an incompatible way in the repository, and we've only recently introduced the repository format version option in the config in 3.0. We need to wait a little longer for that to percolate out into various distros before we can introduce an incompatible change into the repository. Git doesn't have that problem because the repository format version flag for it has been around for substantially longer.
One is that we can store the compressed blob on the server and identify it by the hash of the compressed object, and the other is that we can store it compressed on the client but using the hash of the uncompressed object.
Why can't we store the compressed blob on the server and identify it by the hash of the uncompressed blob? What am I missing?
(Your second paragraph makes sense though)
Why can't we store the compressed blob on the server and identify it by the hash of the uncompressed blob? What am I missing?
There's no intrinsic reason that we can't. However, that's a server-side decision, since it will need to be decompressed on the server side to verify its hash. Whether a server decides to do that is up to it, and that's already possible. The only decision we need to make here is whether the client side should store the data compressed or not.
Also, I do want to point out that in many cases, people have Git LFS assets that don't compress well (e.g., images, audio, or other already-compressed data). Therefore, such a feature would need to take that into account and provide some functionality to control the compression functionality. Git doesn't currently do that, and that's one reason it's a pain to work with those kinds of files without Git LFS.
Thank you for taking the time to respond to my questions and clarify. Given what's been said, it looks to me like the way forward for this feature is:
Note that, since compression decisions are only made locally, the only case where a git LFS instance will not be able to read object files is if the instance was downgraded from a later version.
Would a PR implementing this (with some benchmarking) be welcome?
As an aside, although many binary files do not compress well, particularly those which are already compressed, the overhead of DEFLATE compression is low even for completely incompressible data. I expect the space and time overhead would be practically negligible in almost all cases except on copy on write file systems.
Hey,
It's actually the case with Git that compression has a substantial impact on performance. So I think what we'd want to see here is this:
noop
extension, just like Git has.filter
entry, although a different attribute or other techniques are possible)..gz
extension) on download, marking the repository as using the specified extension (as anticipated in 1) to prevent older repos from using it. Disables git lfs dedup
.With that, I think we'd accept a PR or PRs to implement that feature. I'd recommend that 1 be done in a separate PR first, and then the other features be in one or more PRs, as appropriate.
Have tests been run to show that explicit compression support provides broad better results in compression, reduced compute overhead, vs just putting the LFS store on a filesysten with good compression support?
Have tests been run to show that explicit compression support provides broad better results in compression, reduced compute overhead, vs just putting the LFS store on a filesysten with good compression support?
I don't think so. However, in my experience, most modern file systems don't implement transparent compression support in everyday usage, so I'm not sure that it's very useful to do such tests, since most users won't have easy access to such a file system.
A general counter argument to compressing on client side is, modern filesystems like btrfs support copy on write, meaning the same file on the drive will use the data sectors only once (while being different inodes, so after you change a file, the files will be differnt, albeit possibly still sharing common sectors).
I've done this back in the day with notebooks where harddrive was rare and having had huge svn repositories, with btrfs, the total size almost halved.
If you compress the file it will actually result in mode drive usage when btrfs is used and copy on write is turned on (which at least the last time I checked was off by default, but that was years ago)
Yes, that's true. In the Git LFS case, Git doesn't use copy-on-write because it invokes Git LFS and the file contents are streamed out to a new file. However, this can be fixed by using git lfs dedup
to replace the files in the working tree with copy-on-write versions of the files in the LFS storage. Obviously, if compression in the LFS storage is used, then that functionality won't work, which is why we'd need to disable git lfs dedup
in that case.
Compression can still be useful because while Btrfs and APFS have copy-on-write behaviour, NTFS and Ext4 don't, and that means that almost all Windows users and many Linux users won't have de-duplication.
I am wondering, what exactly is needed to have compression implemented soon? Money? How much?
Transfer compression is crucial for many people in modern work-from-home times. Shouldn't be that complicated to finance through crowdfunding. The actual storage compression would also be nice but is probably less relevant for commercial users (NTFS is good enough).
Transfer compression is crucial for many people in modern work-from-home times. Shouldn't be that complicated to finance through crowdfunding. The actual storage compression would also be nice but is probably less relevant for commercial users (NTFS is good enough).
There are 4 "pillars" of compression here:
And to be honest in case of "transfer compression" downloads happen far more often than uploads anyway, no?
PS: It depends on the data/application used, also to be honest, many binary formats are already compressed, like images, videos, OpenOffice files, a second compression won't help anything.
This feature would make GitHub LFS more usable, since its current limitations and upgrade plans makes it quite unusable (see https://github.com/orgs/community/discussions/128695)
We have several .dds/.tga texture files that compress very well. can we add a field to the pointer file that specifies a compression scheme?
so basically we'd get: