openSUSE / obs-scm-bridge

GNU General Public License v2.0
3 stars 7 forks source link

git scmsync sourced .src.rpm and submit requests do not contain meta data or history #5

Closed JanZerebecki closed 1 year ago

JanZerebecki commented 2 years ago

Is your feature request related to a problem? Please describe. Currently when putting the source via scmsync from git into OBS the built .src.rpm files and submit requests of the package do not contain history or meta data.

While having the history present doesn't hinder classical package maintenance that adds a patch in the source package, for some people the preferred form for modifications is the git repository. Also from time to time we have tasks that require the history, sometimes this happens decades later. Some of this is also a legal requirement in the GPL. To verify signatures on the package source, one also needs the .git as these are in the commits.

https://github.com/openSUSE/obs-service-tar_scm supports the argument package-meta to include the .git in the tar.

One problem with this is that the history may grow too large for the rpm size limit, maybe in this case we have enough time to fix that. In other cases a shallow clone that excludes all branches that are currently not used and the 3rd tag on the current branch, which in effect only includes the commits from HEAD up to excluding the 3rd tag.

Describe the solution you'd like Not precisely set on a solution, but maybe: Use a build time source service to tar the .git and add .git.tar.xz as a Source in the spec. Transparently add this service without it being present in _service, fail the build when the tar is not in the spec file. A submit request would then need to transparently generate the tar. Is this possible, is there a better way?

Describe alternatives you've considered We could not fix this in submit requests, and only validate that the .git.tar is included in https://github.com/openSUSE/obs-service-source_validator which is run before inclusion in Factory after Factory was changed to be Git only. But this delays a lot of improvements instead of allowing them incrementally.

Additional context

JanZerebecki commented 2 years ago

It seems the code that is responsible is https://github.com/openSUSE/obs-scm-bridge/blob/main/obs_scm_bridge#L166

It also seems like .git is not available at build time, so instead of a build time service the bridge can be made to always create an archive of it, thus also fixing it for submit requests.

adrianschroeter commented 2 years ago

this can become an option to include the history, but mls pointed out that we should normalize the on-disk git object store first as they are not reproducible. So we would store way to much data as it breaks any delta mechnism.

JanZerebecki commented 2 years ago

We need to do something similar for the scm service: https://github.com/openSUSE/obs-service-tar_scm/issues/452

However an naive normalization of unpacking all pack files and storing individual git objects, will likely have the opposite of the goal as those pack files are quite efficient and most git clones with shared history will share the exact same pack files. So maybe an not-perfect solution is better.

JanZerebecki commented 2 years ago

The scm service has another source non-reproduciblity, as it keeps a repo around which the user may locally change, which can be fixed by locally recloning from that with the correct arguments.

This bridge should already create mostly reproducible archives of git repos, because nobody changed the repo and fresh bare git clones are from my experience reproducible. If we only include refs we want to keep as part of long term history then the logic of how git creates pack files should already be also space efficient for obs delta storage optimization.

It seems the git clone is recreated each time, so we are good on that side: https://github.com/openSUSE/obs-scm-bridge/blob/67f17ebde3d22312d81a79c440337a75b704d360/obs_scm_bridge#L124

As we do a non-bare clone we also need to take care about things that record the date: We can delete or omit .git/index (as we don't have any not commited changes it can be recreated with git reset --mixed HEAD). The reflogs in .git/logs can be deleted or omitted (or not created in the first place with git clone --config 'core.logAllRefUpdates=false' URL and restored with git config --replace-all core.logAllRefUpdates true && git reset --soft HEAD).

JanZerebecki commented 2 years ago

Maybe we should also omit all refs except the currently used one, that is only include the current branch.

To then reproduce an older archive when your git repo has newer commits and refs, you would need to work backwards from the object id that head points to. So for https://github.com/openSUSE/openSUSE-release-tools/blob/205e07a9d442993b842f0d5dcf1dc49d1093b8c5/check_source.py#L536 we need to have a script to do that.

This then leaves us without tags and git notes. You can normally delete tags and can not rely on the tag date to infer if it is newer, unless you verify or enforce the dates. One option is to have the git server reject tags that are older than say a minute and refuse to delete any. The proposal from https://gitlab.com/JanZerebecki/git-verify is to checkpoint the tags in a file that is committed. For projects like Factory are a git repo with submodules we could instead only checkpoint the refs of the submodules in the project repo. Another is a transparency log like https://korg.docs.kernel.org/gitolite/transparency-log.html .

adrianschroeter commented 1 year ago

there is now the keepmeta=1 cgi option where you can opt-out of removing git meta data.

The reproducible storing mechanic is still to be done, but tracked in the README