google / git-appraise

Distributed code review system for Git repos
Apache License 2.0
5.14k stars 146 forks source link

Add optional CI output to note #50

Open lollipopman opened 8 years ago

lollipopman commented 8 years ago

Instead of pulling CI output from a URL, add the ability to directly include the output in the note. This ensures the historial output is always available.

googlebot commented 8 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


ojarjur commented 8 years ago

First off, thanks for contributing. I really appreciate you taking the time and putting in the effort to make git-appraise better.

The goal of this change is very similar to the discussion we've had in this issue regarding the long-term storage of static analysis reports.

There is definitely value in storing the raw data inside of the git repository, but there is also risk of hitting scaling issues for very large projects (whether they be large in terms of codebase, change history, or both). For example, the build and test log for this commit is 273 lines long, and that's for a very simple build and test configuration. It's possible that this might still scale well if the build logs are largely similar to each other (and thus compress well), but I suspect that a lot of projects will have build logs that don't compress well (e.g. if they include a lot of data that is unique to each run, like timestamps).

To be fair, this change does make the output field optional, so each project can decide for itself whether or not to populate it, and how much data to put in there. However, I worry that it's very easy for someone to accidentally hit a scaling issue and that it will then be difficult for them to back out of said issue.

Alternatively, I wonder if we can enable the following scenario:

* Some project sets up their test runners to write logs to the git repository.
* Said project runs in this configuration for a while (perhaps years) without issue
* One day their repository history with the build logs gets large enough that they hit an issue and decide they want to clear out the existing history of build logs.
* They do said clearing out, but in such a way that their high-level build status (e.g. passed vs. failed) is still maintained going back to the beginning of the repository.

E.G. I'd like it if a user could easily drop their build-and-test detailed history while still maintaining the summaries.

What do you think about the idea of storing the logs under a new git ref (for instance, something under the namespace "refs/notes/devtoolsdetailed/"), and then making the existing CI object include a reference to that?

I will concede that this would be a much larger change than the current proposal, so I understand if you don't have time to get to that, but I do think it would be a safer change in the long run.

There's also some ambiguity here about how the detailed logs should be stored. For instance, it could be a single ref holding all logs, or separate refs. I suspect our best bet would be to let the build runner define that, and make the CI object specify enough information to support multiple scenarios. If we go with this route, then I think the minimum amount of information the CI object would have to store is:

  1. The SHA1 hash of the git object holding the build-and-test logs.
  2. The name of a git ref that can be fetched in order to pull down said object.

What do you think?

lollipopman commented 8 years ago

What do you think about the idea of storing the logs under a new git ref (for instance, something under the namespace "refs/notes/devtoolsdetailed/"), and then making the existing CI object include a reference to that?

That makes sense, would you be okay with storing those detailed notes as plain text? I was intending to have my static analysis tool, in this case puppet, create a puppet noop output for each commit. I would then like to be able to view each individual commit along with the ci output, using just git log --notes=refs/notes/devtoolsdetailed.

I will concede that this would be a much larger change than the current proposal, so I understand if you don't have time to get to that, but I do think it would be a safer change in the long run.

I am willing to give it a go.

There's also some ambiguity here about how the detailed logs should be stored. For instance, it could be a single ref holding all logs, or separate refs. I suspect our best bet would be to let the build runner define that, and make the CI object specify enough information to support multiple scenarios. If we go with this route, then I think the minimum amount of information the CI object would have to store is:

The SHA1 hash of the git object holding the build-and-test logs. The name of a git ref that can be fetched in order to pull down said object. What do you think?

What about an array of refs & an array of SHA1 hashes?

googlebot commented 8 years ago

CLAs look good, thanks!

ojarjur commented 8 years ago

That makes sense, would you be okay with storing those detailed notes as plain text?

That sounds reasonable; the log is just a blob, and git already has an object type for blobs, so maybe we can just use it as is. There are complications, though, for distributing the blobs (see below).

I was intending to have my static analysis tool, in this case puppet, create a puppet noop output for each commit. I would then like to be able to view each individual commit along with the ci output, using just git log --notes=refs/notes/devtoolsdetailed.

That seems like a perfectly reasonable goal. I'd actually go further and just use the ref "refs/notes/commits", which is the default notes ref. The reasons I say that are:

  1. Once you decide to store plain text that should be human readable via the git log --notes command, you're back to the original purpose of git-notes, so I think it's reasonable to use the default ref.
  2. The overriding purpose of the refs/notes/devtools/ prefix is to indicate that the data is not human readable. I assume that if we do eventually add the refs/notes/devtoolsdetailed/ prefix, that it should come with the same disclaimer. In your case the data is human readable, so there's no need for such a disclaimer.

That brings us back to the question of distributing the blob with the log contents to the various clones of the repository. What we need is a ref that someone can fetch from a remote (e.g. refs/notes/commits), map to a remote-specific ref in their local repository (e.g. refs/remotenotes/origin/commits), and from which the blob object is reachable.

The bit of difficulty is when we have two different writers for a note on the same object under the same notes ref. We need to have some way of combining the contents from both writers. The current formats do this by using the cat_sort_uniq strategy, which is why they are all written to a single line. Putting the raw log directly in the note would break that, but it might be doable using the union strategy. Would you want the logs from every CI runner when you run git log --notes?

Alternatively, we could use a different notes ref for each CI runner. That might still have issues, though, if a single CI runner tests the same commit twice.

However, the more I think about it, the more that I suspect that when you run git log --notes the most useful output is to just see the log of the most recent CI run. Since notes are versioned, if you always overwrote the current note, then historical blobs would still be reachable. As such, that would satisfy our requirements for distributing the blobs without having to solve the problem of merging them.

If we went with this overwrite approach, then you'd want your CI runner to:

  1. Fetch the refs/notes/commits ref from origin and merge it using the theirs strategy.
  2. Overwrite the current note for the commit in the refs/notes/commits ref.
  3. Push the new value of the refs/notes/commits ref back to the origin.
  4. If step 3 fails try again from step 1.

On the client side, we'd just have to do a fetch of the notes ref from the origin to some non-conflicting location. If the user wanted to show these blobs in the output of git log, then they would have to either merge the notes on their machine, or specify the local copy of the ref from the remote (e.g. refs/remotenotes/origin/commits). Either way, we could leave that to the user and not have git-appraise do anything.

Sorry that this response was long and rambling, but there are subtle complexities here and I want to make sure that however we address those complexities still works for your use case.

I am willing to give it a go.

Thank you, I really do appreciate you volunteering your time and effort to make the tool better.

What about an array of refs & an array of SHA1 hashes?

The CI notes are already an array for each commit (e.g. each commit can be tested multiple times with the summary of each run reported), so I don't think that is necessary. That being said, if I'm overlooking something please let me know.