git-lfs / git-lfs

Git extension for versioning large files
https://git-lfs.com
Other
12.7k stars 2.01k forks source link

remove git lfs history #1101

Open limor-gs opened 8 years ago

limor-gs commented 8 years ago

I have backup.tar file (175.94 MB) which pushed every night to git lfs. In some point I got the following error:

Git LFS: (0 of 1 files) 0 B / 175.94 MB
This repository is over its data quota. Purchase more data packs to restore access. Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/ error: failed to push some refs to 'git@github.com:cloudify-cosmo/cloudify-build-system.git'

Is it possible to remove all lfs files history older than 30 days to avoid this issue? Practically I don't need the tar copies that older than 30 days.

limor-gs commented 8 years ago

Is someone can answer my question?

javabrett commented 8 years ago

Somewhat related to asks in #1043 . I don't expect that there is an easy way to do this at the moment. You might consider some other inexpensive block store for your tar backups. It doesn't sound like a thing you'd normally want to do with LFS - backup files (presumably from the very same repo?).

technoweenie commented 8 years ago

@limor-gs Apologies for the silence. I was on vacation last week.

Git LFS has no way to prune old objects from the server that are still referenced by git commits, but it is something I'd like to do, eventually. I wanted to start a proposal for it based on some conversations I had at GDC a couple weeks ago. This definitely isn't a quick fix though, or my highest priority.

In the short term, I'd suggest the following:

  1. Git isn't designed for backups like this. However, bup looks interesting, and is at least based on the git packfile format. Of course, there are a ton of other backup options too.
  2. Move your LFS data to another service. It is possible to keep your Git data on GitHub (or any Git host, really), and use a separate LFS service. Check out https://github.com/github/git-lfs/wiki/Implementations.
technoweenie commented 8 years ago

@javabrett Doh, thanks for the pointer to #1043. Reading this again, it looks like I misunderstood what was being asked, and never followed up.

technoweenie commented 8 years ago

@limor-gs Here are some questions:

  1. How would you, as a user, want to purge LFS objects?
    • Would you purge a specific version of a file?
    • Would you purge ALL versions of a file?
    • Would you want to purge a range of versions? Maybe keep weekly checkpoints of active files or something.
  2. Do you want the ability to set policies on files, file types, or directories?
    • Keep n versions of a file.
    • Keep periodic (daily/weekly/monthly) checkpoints of files
  3. What would you expect Git LFS to do if it can't download a file that's been purged?
    • Pointer file
    • Nothing
  4. What would you expect Git LFS to do if it can't download a file due to some server issue?
    • Halt current checkout/clone/pull command
    • Pointer file
    • Nothing
rBatt commented 8 years ago

My LFS repos are getting bloated in the .git/lfs folder. git lfs prune doesn't seem to cut it (I tried changing git config to induce more aggressive pruning, but to no avail). Maybe this is what I'd be looking for?

It's not clear if this purge would remove things from the online storage, or just from local lfs history. I'm running out of space on both, but I can't buy a new hard drive for my laptop. At least I can upgrade online storage :)

Generally, I'd want to purge all but the last N versions of a file. If I could flag versions of a file to be exceptions to this rule (git lfs purgeProtect --file --commit; git lfs --showVersionsStatusFlags --file; git lfs purge --3 --exceptVersionsWithFlag), that would be great.

I would have a default policy for all files tracked by lfs, once purging is indicated to be desired (some command to turn on purging, as a config or attribute, idk). So maybe default would be similar to enacting an automatic prune, once turned on?

Another approach would be similar to what Time Machine on my Mac does. Bear with me here. If I could set a size limit for the repo, that'd be like the size of my backup disk. Given that size limit, rules like this could be implemented:

screen shot 2016-05-20 at 11 49 17 am

Replace "backup" with "version"/"commit", and maybe that's the start of a framework to avoid bloating?

If Git LFS can't get a file that I ask for because I'm doing a checkout or something, I think it should just give me a pointer file if it can (I'm assuming that the pointer files for all versions can be stored indefinitely). That way, if I got to run some old version of code from 6 months ago, I get an error (in my R code, e.g.) when the now-erased binary file is instead just a pointer. It might be nice to be able to visualize which commits (the oldest, especially) for which all file versions are available. But really, just give me the pointer if I asked for a deleted version ... I could always figure it out myself.

ostrokach commented 8 years ago

I would love to have an option to purge old git-lfs files as well!

I use git-lfs for versioning MySQL database files, and the cumulative file-size can get out of hand very quickly. Something like git lfs --purge 3 would be great.

ciwolsey commented 7 years ago

Same, I'm developing a game with large assets and would like to use Git LFS but the cumulative effect coupled with storage costs make it pretty impractical.

joncursi commented 7 years ago

Likewise. I would like to have a rolling storage, meaning the last 1GB (configurable) are stored with LFS, and when I push new files, old ones are dropped from storage, so I'm only keeping the last 1GB-worth of large files.

This way my storage will not grow indefinitely, which is causing costs to grow as well.

technoweenie commented 7 years ago

Thanks for all the feedback. I can see how it'd be useful for you all. How would you feel about something like this:

  1. Add an LFS API endpoint for deleting objects. Deleted objects return 410 Gone instead of 404 Not Found, implying it was deleted on purpose.
  2. Store some kind of purge policy in .lfsconfig:
    • Apply settings per git remote. Maybe you want to keep everything in cheaper cold storage somewhere?
    • Apply settings per dir
  3. Check for any files to purge after git push or git lfs push have finished.

My main concern here is that giving people the power to delete their own data will lead to some really terrible scenarios. Maybe the onus should be on the lfs hosts to track this properly in case someone does something silly...

joncursi commented 7 years ago

The config mentioned above sounds like it would handle my use case perfectly!

technoweenie commented 7 years ago

Cool. We have Git LFS work planned through the rest of the year probably (https://github.com/git-lfs/git-lfs/issues/1632), so we won't even be able to start planning for this until next year.

joncursi commented 7 years ago

Hey @technoweenie - happy 2017! Any updates on this issue?

My GitLFS costs are growing needlessly. Had I known Git LFS cost money on the onset, I would have stuck with normal storage techniques. Now, I unfortunately have a bunch of old junk stored in Git LFS that I have to pay for monthly.

I really need rolling storage (only recent files are valuable to me), or a way to purge out these files from Git LFS so I can reduce my data pack usage. Thanks!

technoweenie commented 7 years ago

No update yet. Any update will be visible on this repository. This is going on the roadmap for this year, but I'm not sure when we'll get to it. You're welcome to send us OIDs to delete from your repository to keep costs to https://github.com/contact. I know it's not ideal, but it'll keep costs down.

ollieparanoid commented 6 years ago

@technoweenie, looks like you're collecting use cases in this file - so maybe it helps if I write down ours: storing binary packages for a Linux distribution.

A few ideas from my user's perspective:

technoweenie commented 6 years ago

Thanks for the feedback. It sounds like these packages are to distribute your project. Have you thought about using the GH Releases feature instead? That doesn't necessarily negate your use case of course, I'm just curious if it'd work better for you.

Thanks for the feedback on the GH interface, regarding LFS file status and undo support. :+1:

ollieparanoid commented 6 years ago

GitHub Releases is a fine feature, but as I see it not suitable for a binary package repository for a Linux distribution. Because we need a specific directory structure with subfolders. And we have lots of small binary files, as I understand it we would need to tag a new release and upload all files again whenever we release a new version of one of the files.

svperfecta commented 6 years ago

It's tough to rewrite history. I think a potential solution would be to support a retention period on files for git lfs that essentially removes old files, only if a newer version exists. When that file is requested, github should just return a 0byte file or something in it's place.

Amazon S3 has similar settings for versioning. It allows you to define the number of previous versions it will keep, and at what point those are pruned. The 0byte file trick would allow Github to preserve all history without rewriting it.

svperfecta commented 6 years ago

Hey All - I noticed this isn't actually on the roadmap, although some of the earlier comments mentioned this year.

@technoweenie Just wondering if this should be on the project board? If not, all cool, but I want to plan accordingly :)

ttaylorr commented 6 years ago

@technoweenie Just wondering if this should be on the project board? If not, all cool, but I want to plan accordingly :)

Not currently, but this is still something that I'd like to explore later in the future.

endiliey commented 6 years ago

I had a similar issue. Hopefully this will be implemented in the future. For now, I'll just purchase extra storage from Github :)

isedwards commented 6 years ago

The absence of this feature is actually preventing us from adopting LFS. Our use case is that we have all our code under source control, but it would be extremely useful to store some large/binary data alongside the code that is essentially unversioned (all pointers to the filename should just reference the most recent version which can overwrite previous versions). Removing the file would result in all versions no longer existing/taking up space.

I guess it's not a use case that git (and LFS) were ever really meant to solve, but in our case, this would allow anyone using the latest release to setup/build/test really easily with everything available in the repository. We anticipate that old releases would also mostly just work with the latest LFS data.

IAXES commented 6 years ago

@isedwards Same here: the inability to purge old revisions of files is what's preventing using LFS as a means to distribute pre-compiled SDKs. For my use case, I have multiple branches of an in-house SDK (10GB+) I share with several others engineers, with the current "beta" branch being re-built every night. Once an individual branch is stable, we don't need the 200+ outdated versions anymore, and would ideally purge those versions, keep the current gold/release build, tag it, and create a new branch for the next release.

Really hoping to see this feature at some point.

idavydov commented 6 years ago

Hi. Since you're collecting potential use cases. Here's what I think might be very useful for data analysis.

Some relevant discussions from the bioinformatics perspective. Without a satisfying answer. https://bioinformatics.stackexchange.com/questions/112/how-to-version-the-code-and-the-data-during-the-analysis

luerhard commented 5 years ago

Any update on this? I am really hoping for the feature to prune old versions.

bk2204 commented 5 years ago

We don't currently have any news on this front. If there's a plan to implement it, we'll update this issue, and someone will assign themselves to it when they pick it up.

dpanzer commented 5 years ago

Reading through the thread here it seems that some means to address this issue was supposed be scheduled for 2017 but we are here at the end of 2018 and apparently haven't seen any movement. I'd like to propose that perhaps you guys can stand up something quick and dirty that covers what seems to be the most plain vanilla use case here, which is to delete versions of a file older than X for the purpose of keeping storage costs down. Lots of other use cases have accumulated but I think just having this quick fix would benefit the community tremendously. https://github.com/git-lfs/git-lfs/issues/1101#issuecomment-260769632 seems like it would do the trick.

timzatko commented 5 years ago

Hi, any update on this issue?

bk2204 commented 5 years ago

Nope, as mentioned in https://github.com/git-lfs/git-lfs/issues/1101#issuecomment-439998328, we'll update the issue if there's news.

isedwards commented 5 years ago

@bk2204 Excellent, looking forward to it

jonas-eschle commented 5 years ago

We are also eagerly awaiting any news. But to add to the discussion:

My main concern here is that giving people the power to delete their own data will lead to some really terrible scenarios. Maybe the onus should be on the lfs hosts to track this properly in case someone does something silly...

I would really not be. Add in the config of course a clear warning that older versions won't be restorable and let them screw up what they want. The default of course is to store everything. But I cannot imagine a single user being surprised that old versions are deleted. At least their coders and should have a minimal understanding of possible effects. "We're all adults here" ;)

On another hand: what is actually the use case of LFS? I can think of two kind of data that is stored alongside code:

For the first case, an external storage is sufficient. For the second case, this will blow up at some point, rather earlier then later if the package is in development (as most in git are).

Without the possibility to delete files, what is the typical use case then? (I am very happy about the introduction of LFS, but I am also keen to see it improving)

groue commented 4 years ago

Hello, I add my concern to the concerns of the crowd. Git LFS is very handy. But there are some use cases where old versions of files lose their values over time. With the current policy, quota exhaustion force us to rewrite our git history so that those unused old file versions are removed, and disk space is reclaimed. History rewrite is the most violent of all git operations. This is why some people are complaining.

bitsurgeon commented 4 years ago

Hi, I have a question regarding LFS object deletion:

I pushed some files to LFS on a branch of my repo to test a feature, but decided to not use this feature eventually, so I deleted the branch without merging. After delete the branch, I didn't see my LFS storage quota being restored. Is this expected?

Or how should I reclaim the quota in my case without deleting the whole repo?

Thanks a lot.

bk2204 commented 4 years ago

If you have a question about pruning old LFS contents on GitHub specifically, you should contact GitHub support and they'll be abe to help you. If you're seeing problems with another host, like GitLab or Bitbucket, you should reach out to their support folks.

bitsurgeon commented 4 years ago

@bk2204 thanks, I thought this was lfs related. I will check with GitHub.

Christilut commented 4 years ago

This seems like a basic need when using LFS to me. I've added some binaries to my repo and a few versions later I'm surprised by the disk quota notification. But there is no way to delete my old, now unused, binaries? That seems kind of ridiculous.

  1. Commit file that's too large
  2. Github tells me to use LFS
  3. I do what Github tells me
  4. Github tells me to upgrade storage or blocks uploading and provides no way to reduce usage

There's something wrong here or is it me?

bk2204 commented 4 years ago

Hey, folks,

This repository is for the open source Git LFS client project. If you have an issue with how GitHub works or the functionality it provides you to manage your data usage, please address that to the support team via the appropriate channels, as mentioned in the contributing documentation. Similarly, if you have issues with a different hosting provider or the functionality it provides you, please address it with their support folks.

I know this is a very desired feature. We currently have no plans to implement it because our goal in this project is to preserve and maintain your data, not destroy it, much like Git does. There are a whole bunch of thorny issues when it's possible to delete data from history, including access control and malicious actors, and any solution in the client would need to address all of those issues plus server-side issues as well.

It is possible an implementation-specific API could be used to do this outside of this project, such as an API to prune any objects that are not in use by Git history. Requests for such an API are best sent to your server implementation's support department, not here.

We will update you in this issue if our plans on this change. There's no need to continue to post comments in this issue indicating your support. Please use the reactions for that purpose.

isedwards commented 4 years ago

our goal in this project is to preserve and maintain your data, not destroy it, much like Git does.

As long as you're aware that the opposite happens - we periodically delete our entire git history because of this problem with LFS.

This workaround is accepted because it allows the developers to use their preferred toolchain.

isedwards commented 4 years ago

@bk2204 - I believe this request is for the ability to temporarily store (not permanently commit) large files within a git repository without forever consuming lots of space in the history when these files are no longer needed. These files shouldn't need to be committed as a revision.

Could this be achieved by using a design similar to stash? Would this solve the technical difficulties and also fit with git and git-lfs design philosophy?

git store -p
git store list
git store apply store@{1}
git store drop store@{1}
...

Once the capability exists, the individual server implementations could be responsible for controling access for creating and deleting stores if this is required by some projects. In our case, anyone with write access to the repository would be permitted to modify the stores.

bk2204 commented 4 years ago

If this is for local changes only that aren't checked in (or are overwritten and not kept by any reference), you can actually use the standard Git commands and run git lfs prune to prune old objects you don't need. Those won't be uploaded to the server unless someone pushes them.

As far as storing arbitrary objects in the repository without any references, it's likely those will get garbage collected. They'd need to have some reference associated with them in order to not be pruned, in which case you might as well use a stash. The thing is that most server-side implementations don't allow shared stashes, and probably wouldn't support a new type of reference such as your proposed store.

It is possible (but expensive) for a server-side implementation to prune LFS objects that aren't referenced by any Git object, in which case you can use a temporary branch, but I don't know whether there are any server-side implementations which do so. That's why I suggested that as a proposed solution on the server side, because it's a relatively easy win.

Sairony commented 4 years ago

We also want something similar to keep costs down, we make games and PSDs, sound files etc are iterated on and quickly grows out of hand.

I'm no expert on git, but my limited understanding is that it's possible to squash a range of commits together. If so wouldn't it be possible to have a cron job on the server which basically takes lets say all commits for days which are older than a user specified length ( lets say 3 months old ), squash all changes which happened that day across specific branches. This can be extended for longer back, lets say squashing entire weeks when they're lets say 1 year old. One could then tag specific commits which are to be stable to keep the squashing off them. Preferably the squashing would concat the commit messages for all commits which they contain. Wouldn't such an approach essentially orphan the now unreferenced files in the lfs storage, making them easy to clean up? I like this approach because it still keeps the commits atomic. Otherwise I can't think of an approach where you try to clean up lfs files which change independently and still be able to check out a specific commit and be sure that you got exactly the state which it represents.

Christilut commented 4 years ago

This problem is up to Github to fix. I contacted them a while ago about this and they said they have no plans to address this because as a company that stores historical data they prefer not to delete any history.

So if you run into this issue, I suggest changing your approach to where you store your large files or pay up. I'd suggest you only pay if you actually need your historical large files.

RoyiAvital commented 4 years ago

Maybe a simple solution, though only for the future, would be adding attribute - Save Only Last Version.

Let's say I have a file named MyArchive.zip. From time to time I update its content which is needed by the project. Yet the versioning isn't important for this file.

So could I have a .gitattribute line for it so it will save only a single copy of this file - the last one pushed.

brybalicious commented 3 years ago

I am experiencing this problem when making a template for unreal engine. Some content files which take up space are no longer relevant, and have been removed, pruned, removed from cache, untracked locally. However, on the remote, they are lingering and chewing up storage space, despite not being present in the repository (neither locally nor remotely). I've even reset the head and force pushed from a commit before these useless files were commited. These files are apparently called 'orphaned' files, and the approach to [not] deal with them is surprisingly obtuse.

As far as I understand, this is a fairly common issue, and the solutions vary across storage providers. Mostly, they are non-starter solutions: Gitlab Github Bitbucket

As far as Github is concerned - "After you remove files from Git LFS, the Git LFS objects still exist on the remote storage and will continue to count toward your Git LFS storage quota. To remove Git LFS objects from a repository, delete and recreate the repository. When you delete a repository, any associated issues, stars, and forks are also deleted."

brybalicious commented 3 years ago

@bk2204 - if neither the git-lfs project nor github is willing to devote attention to a fairly standard feature request, then as mentioned by @isedwards, users are forced to either adopt new tools, or to play around with the remote's git history on a file-by-file basis which can get very hairy and result in lost data.

Part of what makes git so powerful is that the tool fulfils the functionality required by its users, and is a standard shared language - you don't need to change up your git commands once you start using a new provider. Wouldn't it make sense that the tool itself can perform this functionality which will at some point be needed by almost any project (that doesn't like the idea of exporting a project repository and wiping its commit history)? Perhaps it would be in the shared interest to adopt a standard as suggested by @isedwards and then let storage providers implement it as they wish?

Varying storage provider solutions

zhaoming029 commented 3 years ago

Does anyone try git-filter-repo? https://www.mankier.com/1/git-filter-repo "Rapidly rewrite entire repository history using user-specified filters. This is a destructive operation which should not be used lightly; it writes new commits, trees, tags, and blobs corresponding to (but filtered from) the original objects in the repository, then deletes the original history and leaves only the new"

marxangels commented 3 years ago

I don't believe this is difficult for our developers, but it seems to fall into the fear of design choices.

win845 commented 3 years ago

Imo good design is when a software makes easy things easy and complex things possible . It will be way hard to come up, with default polices per remote as stated in https://github.com/git-lfs/git-lfs/issues/1101#issuecomment-260769632 The most easy use case is to delete everything , but n last versions of every file. Lets start with this

benjaminrich commented 1 year ago

@RoyiAvital: I agree that this is the solution. For files that can be regenerated from source when necessary, I don't see why it would be difficult to say to git: "keep this file as part of the repository, but only the current version and don't track its history." That would be so useful.

yrHeTaTeJlb commented 1 year ago

I wonder, how many Git fans had to chose a different vcs because of this particular issue? Right now I'm to find a way how to deal with huge UE binaries in our repo, and I'm chosing between Git+Lfs, Plastic and Perforce. I like Git, but lack of this feature is the only reason to go for something else