neuropoly / gitea

https://gitea.io fork with https://git-annex.branchable.com support
https://gitea.io
MIT License
3 stars 2 forks source link

`git-annex`: ensure exported archives include git-annex data #42

Open kousu opened 1 year ago

kousu commented 1 year ago

The "Releases" page lets you download an archive (from git-archive?). It would be handy if this worked with git-annex datasets too, since then we could share people a link like https://data.neuropoly.org/neuropoly/some-project/archive/1.0.0.zip and they could get a versioned the data without having to install or learn git-annex (with gitea, GET /neuropoly/some-project/archive/1.0.0.zip downloads as some-project-1.0.0.zip, preserving the version number, so long as the receiver uses a standard browser or curl -LJO).

It's good if people do want to learn and use git-annex, but some applications -- imagine, deploying training sets to clusters, simple exploratory work, etc -- don't need the extra headache.

I'll note that we currently can't do this with e.g. https://github.com/spine-generic/data-multi-subject/releases/tag/r20230223; there, the download links give a small <1MB .zip containing mostly annex pointers.

Plan

  1. [ ] Look into how GitHub handles git-lfs files when archiving
  2. [ ] Look into how Gitea handles git-lfs files when archiving
  3. [ ] Patch gitea to replace annex pointers with their content, if available, when exporting
matrss commented 1 year ago

The status quo seems to be that gitea includes the symlinks in the tar.gz output, but not the annexed files they point to, while the zip output does not even contain the symlinks (which AFAIK would be supported though, but git archive seems to create the zip files in a way that they don't include symlinks).

The latter regarding symlinks in zip files is true for GitHub as well. I didn't look into how GitHub or Gitea handle git-lfs standalone though, the git-annex repo I checked in GitHub uses git-lfs as a special remote.

The git archive command does not seem to be flexible enough to automatically resolve symlinks or include the whole annex in the output or similar things.

In the case of tar we could just add the whole annex to the archive in a second step, but that would include files that are not present in the exported branch/tree-ish and the resulting archive provides a suboptimal experiences in my archive viewer, since I needed to unpack the archive first before the symlinks became usable. This would also not work for the current zip output, because of the missing symlinks.

Therefore, the most straightforward approach I can think of is creating the zip or tar archive using git archive as it's currently done and then in a second step adding all the annexed files directly to that, replacing the (possibly) existing symlinks. We can get a list of all the annexed files in a branch/tree-ish, even those not present on the server, using git annex find --branch <ref> --anything and resolve them to files in the annex using the symlinks tracked in git.

An alternative would be the approach taken by DataLad's datalad export-archive, which does not use git archive and instead reimplements the entire export routine. I think there is some value in keeping the git archive usage though, since it would mean that nothing changes for repos that do not use git-annex.

With this I still have two open questions:

  1. How would we handle missing annexed files? Should gitea refuse to export, show a warning but provide an archive with what it has or not show a warning at all?
  2. Currently, gitea can export to a "bundle" format as well. This seems to be some git specific way of exporting the entire repo including history as a file. It looks like it would be more complicated to get all the data into that. Should we even keep that option? I would be inclined to remove it all-together, since I do not see a use case for it that wouldn't be handled by a normal clone anyway.

What do you think about this? I could try implementing the "create archive with git archive, then add annexed files" approach, if you are open to a PR for that. I am new to Go though, so I cannot guarantee perfectly idiomatic code.

kousu commented 1 year ago

Oh hi! That's very cool. I would love to have some help on this "neurogitea" project!

An alternative would be the approach taken by DataLad's datalad export-archive, which does not use git archive and instead reimplements the entire export routine. I think there is some value in keeping the git archive usage though, since it would mean that nothing changes for repos that do not use git-annex.

That's very interesting. I haven't examined datalad's approach. But I am leaning towards using less components and less code if possible. If I had to ask people to add datalad on top of git-annex + git + gitea it's just one more thing that can get out of sync and break down.

But I also am not even sure how gitea handles exports. I haven't even checked yet if it uses git archive. Maybe it has its own implementation.

using git archive as it's currently done and then in a second step adding all the annexed files directly to that, replacing the (possibly) existing symlinks

When you do this, you need to remember that there are two kinds of annexed files: symlinks and pointer files. Pointer files are the default in repos made with git-annex v8. I have code that handles both cases in

https://github.com/neuropoly/gitea/blob/5149ad0fb20167a89b217e2e94fe9cc8da908fb9/modules/annex/annex.go#L139-L153

(though it could probably be tightened)

What do you think about this? I could try implementing the "create archive with git archive, then add annexed files" approach, if you are open to a PR for that. I am new to Go though, so I cannot guarantee perfectly idiomatic code.

I'm also pretty new to Go! Working on gitea has been the most experience I've had with it. But it's not meant to be a difficult language to pick up so I think you'll probably be okay. So yes, please :)

matrss commented 1 year ago

Oh hi! That's very cool. I would love to have some help on this "neurogitea" project!

Hi there as well! My experience with using this project has been pretty great so far, so thanks for starting it.

But I also am not even sure how gitea handles exports. I haven't even checked yet if it uses git archive. Maybe it has its own implementation.

Looks like it is done in https://github.com/neuropoly/gitea/blob/5149ad0fb20167a89b217e2e94fe9cc8da908fb9/modules/git/repo_archive.go#L52-L75, which does use git archive for .tar.gz and .zip. It shouldn't be too hard to extend that to also add annexed files to the archive after the fact.

In the case of .bundle files I think it happens in https://github.com/neuropoly/gitea/blob/5149ad0fb20167a89b217e2e94fe9cc8da908fb9/modules/git/repo.go#L271-L309. This looks a bit more complicated and we would need to investigate if there even is a sensible way to include annexed content there.

When you do this, you need to remember that there are two kinds of annexed files: symlinks and pointer files. Pointer files are the default in repos made with git-annex v8.

Thanks for the heads-up, I wasn't aware of that. Funnily enough this explains a recent confusion I had when trying to get a file out of git-annex. I tried unlocking and commiting the unlocked file, but the resulting non-symlink file was still shown as part of the annex. Turns out that made it into a pointer file and unannex, which I found later, was indeed the correct approach.

I'll try to come up with a PR and report back, it might take me a while though.

kousu commented 1 year ago

My experience with using this project has been pretty great so far,

Interesting! I'll be curious to compare notes.

By the way I have some deployment scripts, in ansible, but I haven't published them to galaxy.ansible.com yet. I just need a nudge from knowing I have users to kick me into gear and actually put it out there.

I'll try to come up with a PR and report back, it might take me a while though.

Wonderful! Good luck and let me know if you need any help!

matrss commented 1 year ago

the zip output does not even contain the symlinks

Looks like this is not true, it's just that the gnome archive manager I was looking at didn't show them. Unzipping on the CLI with unzip shows that symlinks are preserved.

Interesting! I'll be curious to compare notes.

By the way I have some deployment scripts, in ansible, but I haven't published them to galaxy.ansible.com yet. I just need a nudge from knowing I have users to kick me into gear and actually put it out there.

We are running a publicly reachable instance of this Project (https://atris.fz-juelich.de/), as well as an internal one, at my place of work as part of an initiative to establish DataLad in our institute and area of research. That is very much early work-in-progress though. I am also maintaining a "fork" of this repo at https://jugit.fz-juelich.de/m.risse/gitea which contains a number of FZJ specific changes (mainly things like theming and configuration for external renderers for netCDF and grib files, which are common file types for us). I am using docker-compose to manage these sites, so I have no immediate need for your ansible deployment scripts.