google / git-appraise

Distributed code review system for Git repos
Apache License 2.0
5.13k stars 146 forks source link

Support pulling the git-notes for reviews from an untrusted repository. #71

Open ojarjur opened 7 years ago

ojarjur commented 7 years ago

This is meant for situations where an outside contributor requests a pull from their repository to an upstream repository.

In that scenario, the outside contributor can pull reviews from the upstream repository, and can push their review metadata to their repository. However, the maintainers of the upstream repository probably do not want to pull review metadata for all reviews from that outside contributor's repository, but do want to pull review metadata for that one request.

I imagine a scenario like the following:

  1. A repository (that we'll call upstream) is hosted somewhere.
  2. Outside contributor creates their own fork of that repository (that we'll call fork).
  3. Outside contributor makes some changes in their fork.
  4. The contributor requests a pull (either via git request-pull or something like a GitHub pull request) from their fork into the upstream repository.
  5. The maintainers of the upstream repository fetch the changes from the fork into a branch and create a review request to merge that branch into the master branch.
  6. The maintainers add some comments to the review.
  7. The outside contributor runs git appraise pull <upstream> to fetch the review request and comments from the upstream repository. They then respond to the comments and run git appraise push <fork> to push their comments to their fork.
  8. The upstream maintainers run something like git appraise pull --only-comments <review> <fork> to pull just the comments for that review from the fork, and then they run git appraise push <upstream> to push the combined review metadata to the upstream repository.
  9. Repeat steps 6-8 until the maintainers are happy with the review and submit it.

Since the upstream only really cares about the comments from the contributor (the review requests will be different because the review refs will be different, and the maintainers may want to abandon the review), I'm thinking this can all be supported by adding a flag to the git appraise pull subcommand that tells it to only pull the notes from the refs/notes/devtools/discuss ref, and to only merge in the notes for a specified review.

ojarjur commented 7 years ago

Some things that would be important here:

  1. Make sure that for each comment copied over, the line of text is copied verbatim. This is necessary to prevent the comments from accidentally being duplicated when git-notes are merged.
  2. Since the comments are from an untrusted source, we should have an interactive mode where the maintainer pulling comments can filter out inappropriate comments before they are merged.
  3. Even if the pull is not being done in an interactive mode, we should support some way to filter out comments that should not be pulled from the remote. For example, comments whose author is not the owner of the fork should not be pulled from the fork.

I think one way of achieving this would be to have a flag specifying the author whose comments should be pulled. If no such author is given, then the user is prompted for each new comment to decide if it should be included or excluded.

ojarjur commented 7 years ago

It might be worthwhile to make the work flow be something like "Pull new comments from the user for all reviews", but that may be more tricky than pulling the comments for only a single review, since we have to first figure out which reviews have new comments in the fork.

ojarjur commented 6 years ago

My thinking on this issue has evolved recently. My previous suggestion has always felt too cumbersome and limited to me, and now I think I may know a better way to handle this.

The core idea is that there should be a graph defining how separate remote repositories are linked together. The git appraise pull command would use this graph to pull review metadata not just from the remote repository, but also from all of the linked fork repositories.

This graph should be stored in the repository itself so that third-parties can pull review metadata from multiple forks at once. That allows the network of developers to continue collaborating on changes even if the "authoritative" remote repository stops being updated.

A repository owner could add a fork by running a command like git appraise add-fork <URL>. By doing that they would in effect be giving the owner of that other repository permission to request reviews or add comments. If someone else thinks that a repository owner is being too restrictive in adding forks, then they could add those forks to their own repository and encourage others to use their repository as a remote.

To make this work we would need to make sure the following properties hold:

  1. Pulls from a fork are append-only. I.E. an abusive forker could not hide review requests and/or comments by pushing git-notes commits that delete them.
  2. Review metadata pulled from a fork is somehow authenticated. I.E. an abusive forker cannot add comments posing as someone else. We might do this by listing the allowed "author" values for each fork (e.g. only pull requests and comments from "https://omars-repo.example.com" if the "author" field is "omar"). Alternatively, we might introduce some concept of users based on cryptographic keys, have the tool cryptographically sign each request/comment, and then display the author as being that key.

I do not think we need a way to audit incoming requests/comments before adding them, since we now have a way to edit comments without losing the original comment.

However, we do need to gracefully handle multiple, unrelated review requests for the same commit. Right now we track all of the requests, but treat the last one as the "current" request with all previous ones being obsolete. In this new model we need to be able to distinguish between multiple, semantically different reviews on the same commit. That is needed so that one forker cannot hide another forker's review request by shadowing it with their own.

We might do this by adding a parent field to review requests similar to how we have a parent field for comments. With that we could distinguish between editing an existing review request versus requesting a separate review based on the same commit.

We should probably also add something like a reviewRemote field to review requests, to specify which remote holds the updates to the reviewRef. That field should probably not be changeable by edits to the request.

There are a number of things about this approach that I like:

  1. This gives repository owners control over who they collaborate with...
  2. But, this does not require granting anyone else permission to push to your repository.
  3. This breaks the concept of a single, remote repository; giving us instead a network of linked repositories.
  4. The linked repositories can be hosted anywhere.
  5. This does not require an interactive mode where the user has to audit each incoming piece of metadata.
  6. We can add metadata about allowed authors, etc to the graph itself, so...
  7. This does not require any additional args to the git appraise pull command.
  8. Only the owner of a given remote repository has to run the git appraise add-fork command.
  9. We can also include a git appraise remove-fork command to remove a fork if its owner is being abusive.
ojarjur commented 6 years ago

Regarding authenticating metadata; I think we should try to imitate the model git uses for GPG signing commits.

This would mean adding a -S flag to all of the git appraise subcommands that generate review metadata (e.g. review requests, comments, etc). That metadata would then be signed prior to being written. We would use the user's GPG key defined in the user.signingkey config setting to generate this signature. Additionally, the generated git-notes commits would be signed with that same key.

When displaying reviews/comments/etc, we would include information about whether or not the metadata was signed and whether or not the signature could be verified.

We would also want to add a --verify-signatures flag to the git appraise pull command which would cause the merge to fail if any of the incoming git-notes commits or review metadata were not verifiably signed. This would correspond to the same, existing flag in the git merge command. The git appraise pull command should also take a -S flag to cause the git-notes merge commit to be signed. If the --verify-signatures flag is provided then the -S flag should be implied.

To sign a metadata entry, we could do something like the following:

  1. Add a signature field to the various JSON schemas.
  2. Initially fill in that field with a placeholder value like gpgsignature.
  3. Serialize the JSON object as a single line string.
  4. Generate a detached GPG signature of this string.
  5. Encode the generated signature as a single-line JSON string.
  6. Replace the placeholder value with the encoded signature.
  7. Write out this final value as the actual git-notes entry.

Verifying the signature would then involve running that process in reverse.

This would mean that signing changes and checking signatures would be an optional feature that each individual community could decide for itself whether or not to adopt.

It would also mean that this feature would be orthogonal to the rest of the work described in this issue. I will probably split off a separate issue to track it.

ojarjur commented 6 years ago

For storing the graph of forks, I am now leaning towards a design based on storing the remotes section from the git config in the repository itself. This would let remote repositories define their respective remotes (which I will call forks to avoid confusion). Cloners could then use that information to reconstruct the graph as it would be if the remote repository was updated by pulling from all of the forks.

The list of forks would be stored as a git commit. The files in that commit's trees would be named after the names of the forks. The contents of each file would list the fork's URLs and fetch specs. By using a git commit, we can track the history of the list of forks, which seems like a valuable property to have.

When someone runs git appraise pull against a remote, the tool will first fetch this forks-commit. It will then read the fetch-specs for each fork and convert them into a corresponding fetch spec for the local repository.

If the fork fetch spec ends in something starting with refs/heads/, then that prefix will be replaced with refs/remotes/<REMOTE>/. Otherwise, we will add a prefix of the form refs/forks/<REMOTE>/.

Thus, if your remote is named <REMOTE> and has a fork named <EXAMPLE> with a fetch spec of +refs/heads/*:refs/remotes/<EXAMPLE>/*, then your local repository will fetch from <EXAMPLE>'s URL with a fetch spec of +refs/heads/*:refs/forks/<REMOTE>/refs/remotes/<EXAMPLE>/*.

We'll then need to make the tool append git notes from all refs of the form refs/remotes/<REMOTE>/refs/notes/*/devtools/... into the corresponding refs/notes/<REMOTE>/devtools/... ref prior to merging that ref into the refs/notes/devtools/... ref.

dolanor commented 6 years ago

I really like where it's going. Thanks! And clearly, using PGP as the way to auth the comment is the way to go.

I have a question regarding importing comments from a fork. I'm not sure I agree with the "only import comments from the owner of the fork". I think that external contributors might have interesting point of view. The manually validating each comment for import could be painful. Though, in the CLI, showing a list of commenters on the fork, with a number of comments by contributors could give some idea if you would authorize comments from contributors X or Y when they are not the fork owner.

ojarjur commented 6 years ago

@dolanor Thanks for chiming in. I appreciate the feedback.

I do want external contributors to be able to comment, and I want them to be able to comment on any review (not just the ones they request).

My thinking about how to do this was that each external contributor should have their own, personal fork for which they are listed as an owner. The tool would then fetch and merge comments from each external contributor, but only from their fork.

For example, consider the scenario where there are two forks, one named "alice" with an owner email of "alice@example.com" and another named "bob" with an owner email of "bob@example.com".

In this scenario, anyone pulling from the forks would fetch all comments from both the fork "alice" and the fork "bob". However, they would only merge comments from the "alice" fork if the author was "alice@example.com", and they would only merge comments from the "bob" fork if the author was "bob@example.com".

This way, we validate that comments are coming from the fork that they are supposed to come from, and we can do that validation automatically (without manual user intervention).

Would that address your concern about letting external contributors comment, or is there another scenario that is missing?

dolanor commented 6 years ago

Yes it would. One question though: if alice forked your repo and bob commented on alice repo, could it be possible to automatically discover and fetch bob's comments, even if you only knew alice's repo in the first place?

ojarjur commented 6 years ago

@dolanor The design (for which I have an initial implementation in this branch) does not rely on the user knowing all the forks.

Instead, the forks are stored in a special ref in the repository and can be pushed/pulled with automatic merging.

So, for example, if I have a repository at https://github.com/ojarjur/example, and Alice creates a fork of that repository at https://git.example.com/alice/example, then I would first have to run:

git appraise fork add -o alice@example.com alice https://git.example.com/alice/example
git appraise push

Then any third party could clone my repo at https://github.com/ojarjur/example, and run git appraise pull to see reviews and comments pushed by Alice to her repository. This happens without that third-party needing to know anything about Alice's fork; the tool automatically pulls down everything it needs to know from the main repo.

If Alice gives Bob permission to push to her repository, then any reviews or comments he pushes as "bob@example.com" will be ignored because he was not listed as one of the owners of that fork.

Instead, Bob would need to set up his own repository (let's say he uses https://git.example.com/bob/example), and then convince me to add his fork to the list stored in the main repo.

However, once I add his fork, anyone will then start picking up his reviews and comments without needing to do any manual steps aside from running git appraise pull against my repo.

dolanor commented 6 years ago

Ok, I see. So the right to comment would not be transitive from one repo to another. Each owner of the repo needs to explicitely add a fork. But once that fork has been added, everybody fetching from the original repo would also get the fork's comments. Is that right?

In the case of

If ojarjur adds alice and bob's repos as forks, then anybody fetching reviews from upstream would get both alice's and bob's comments, right?

What if alice adds bob's repo to her own. Somebody fetching reviews from alice's repo would also get bob's comments, right?

But now, if ojarjur adds alice's fork, and alice adds bob's fork, does ojarjur gets bob's comment since alice added bob's fork? And in that case, would users of ojarjur project would also get bob's comments by fetching the review comments?

ojarjur commented 6 years ago

Ok, I see. So the right to comment would not be transitive from one repo to another. Each owner of the repo needs to explicitely add a fork. But once that fork has been added, everybody fetching from the original repo would also get the fork's comments. Is that right?

Yes.

In the case of

  • ojarjur has github.com/ojarjur/example (let's call it upstream)
  • alice has a fork of upstream in git.example.com/alice/example
  • bob has a fork of upstream in git.example.com/bob/example

If ojarjur adds alice and bob's repos as forks, then anybody fetching reviews from upstream would get both alice's and bob's comments, right?

Yes

What if alice adds bob's repo to her own. Somebody fetching reviews from alice's repo would also get bob's comments, right?

Yes, but to clarify, in order for this to happen that person would have to:

  1. Add a remote (let's say it is named "aliceexample") with a URL of https://git.example.com/alice/example
  2. Explicitly pull the review metadata from "aliceexample", i.e. git appraise pull aliceexample.

Assuming they did both steps, they will see bob's comments that he pushed to https://git.example.com/bob/example

But now, if ojarjur adds alice's fork, and alice adds bob's fork, does ojarjur gets bob's comment since alice added bob's fork?

No. The fetching from forks is also not transitive. So, if I pull from https://github.com/ojarjur/example, then I fetch from all of the forks listed there, but I do not pull forks of those forks.

And in that case, would users of ojarjur project would also get bob's comments by fetching the review comments?

No. The tree of forks is basically flat.

dolanor commented 6 years ago

I wonder if the extra step required (because of no transitivity) would affect the amount of users to the repository. In a way, it looks like the OpenPGP web of trust to me, which I kind of like. But also, the PGP WoT never got so big. And I think UX and usability has been the biggest hinderance to the project. And in another way, most of use of git nowadays are centralized in 1 repo. As long as this repo adds all the forks, we're good. But then, it would make git-appraise centralized again.

Apart from that, I think the rest is good.

ojarjur commented 6 years ago

In a way, it looks like the OpenPGP web of trust to me, which I kind of like.

I agree, there are similarities.

And in another way, most of use of git nowadays are centralized in 1 repo. As long as this repo adds all the forks, we're good.

Yes. I actually hope there will eventually be some sort of a web UI that offers self-service support for adding forks.

But then, it would make git-appraise centralized again.

I don't think this would go all the way back to being centralized.

I think, at the most extreme, we would just wind up with a bunch of federated services. A repository may have just one host, but its forks could be on different hosts as long as the original host does not try to lock people in.

Apart from that, I think the rest is good.

Thanks for chiming in and offering your thoughts and feedback; I really appreciate that.

pittma commented 5 years ago

:wave: Hi all!

Is this

To sign a metadata entry, we could do something like the following:

  1. Add a signature field to the various JSON schemas.
  2. Initially fill in that field with a placeholder value like gpgsignature.
  3. Serialize the JSON object as a single line string.
  4. Generate a detached GPG signature of this string.
  5. Encode the generated signature as a single-line JSON string.
  6. Replace the placeholder value with the encoded signature.
  7. Write out this final value as the actual git-notes entry.

Still a thing which y'all are interested in adding to git-appraise?

ojarjur commented 5 years ago

@pittma Yes, it is, and contributions for it would definitely be welcome.

It is orthogonal to the rest of this work, so I should really split out a separate issue to track it.

I will do that right now.

ojarjur commented 5 years ago

@pittma I created https://github.com/google/git-appraise/issues/89 to track GPG signing separately.