opensearch-project / opensearch-migrations

All things migrations and upgrades for OpenSearch
Apache License 2.0
32 stars 27 forks source link

[RFC] Change repo settings to only allow 'squash' commits #594

Closed peternied closed 4 months ago

peternied commented 5 months ago

I don't see much value in seeing individual commits at the repo level. With the number of unique tools and developers, it makes it harder to look into history to get a sense of what is happening in the code base (to me).

I suggest that we make a change in this repo's pull request settings to the following:

image

Please vote 👍 or 👎 to help gauge what should be done

AndreKurait commented 5 months ago

@gregschohn, I know you had opinions against this when I joined. Do those still hold with the increase in contributions?

chelma commented 5 months ago

I usually do squash commits myself, so this would not change my workflow. I also prefer them as a general rule, as they package all the work for a feature/bug fix into a single, easily visible unit in the mainline history. I'm open to arguments against the approach though.

peternied commented 5 months ago

For your consideration git lolb origin/main | head -n 20. [1]

OpenSearch-Migrations

Here is what todays git log looks like, which looks like merge is used (but I'm not sure of the setting)

*   4fe2e9e4 (HEAD -> main, origin/main, origin/HEAD) Merge pull request #564 from gregschohn/PrintLongRunningWorkForDebuggingPart2
|\
| * 0e2ce10b PR Feedback, adding comments and simplifying a bit of code.
| *   ad7cd402 Merge branch 'main' into PrintLongRunningWorkForDebuggingPart2
| |\
| |/
|/|
* |   27e25e14 Merge pull request #574 from rishabh6788/main
|\ \
| * | c81c7dcf Remove unused 'arguments' parameter
|/ /
* |   49355f9e (tag: 1.0.3-rc.6) Merge pull request #553 from AndreKurait/FixChunkedHeaders
|\ \
| * \   b04d2a1c Merge branch 'main' into FixChunkedHeaders
| |\ \
| * | | 15ad146b Address PR Comments
| * | | 1feffd25 Fix memory leak in NettyDecodedHttpRequestPreliminaryConvertHandler
| * | | 677b6b45 Fix CRLF in ResultsToLogsConsumerTest
| * | | c6713179 Fix parsing body in AddCompressionEncodingTest
| * | | c37cad21 Correct ByteBuf Formatter with CRLF line endings

OpenSearch repository

OpenSearch has been using this setting for considerable time, here is what todays git log looks like:

* cdb57fa475c (origin/main, origin/HEAD) Update supported version in yaml test file for the primary_only parameter in force-merge API (#13279)
* 61ff5f8d53b Fix flakiness in testDontAllowSwitchingCompatibilityModeForClusterWithMultipleVersions (#13281)
* 84679dea01c Snapshot _status API to return correct status for partial snapshots (update version) (#13262)
* c1d5d76006c Update google dependencies in repository-gcs and discovery-gce (#13213)
* 51009b76d48 enabled mockTelemetryPlugin for IT and fixed OOM (#13054)
* 9c35a848a4a Add remote path settings to RemoteStoreSettings (#13225)
* 1c208d581aa [Remote Store] Cleanup local-only translog files if no metadata in remote (#12691)
* 02f9d74ec77 Add faster scaling composite hash value encoding for remote path (#13251)
* b899a27acc5 Snapshot _status API to return correct status for partial snapshots (#12812)
* 1fcb79de074 Fix flaky test SegmentReplicationTargetServiceTests#testShardAlreadyReplicating (#13248)
* 8d9c3895c3c Cleanup CHANGELOG-3.0 (#13216)
* 6e0ed651269 Revert "Add faster scaling composite hash value encoding for remote path (#13…" (#13244)
* 5375970ad32 Bump bouncycastle from 1.77 to 1.78 (#13243)
* 07d447bfd81 Bump netty from 4.1.108.Final to 4.1.109.Final (#13233)
* 8bd0ad94cbf Add warn logs for remote backpressure rejection (#13218)
* a2f07ed39e0 Remove compatibility checker (#12971)
* 3e8e116aca6 Suppress sys out checks as trace logs enabled during tests can cause failures (#13188)
* 695fbde56fa Add validation while updating CompatibilityMode setting (#13080)
* 5e72e1df6d1 Bump joda from 2.12.2 to 2.12.7 (#13193)
* 3d1d5e7a9e7 Add faster scaling composite hash value encoding for remote path (#13155)

In comparison, I see much better information density out of OpenSearch, no 'addressed pr comments' titled commits, and each commit has a PR number associated with it where the nuance is available if I want to follow up in detail.

BTW I mean no shade for any of the commit authors shown above, I am lazy about commit details because I expect them to be squashed.

gregschohn commented 5 months ago

What problem is this trying to solve? From the link, it seems that there's a hope that a git log can form a narrative. I see the git log as one of the best forensic tools that I have at my disposal. I personally don't expect it to read like release notes, but instead, of developer intent.

Unfortunately, I cannot recover that stream of intent when branches from other forks are squashed or rebased into the main repo, especially if you consider that the source repo that the changes came from could go away. Since github is managed by a single-entity and the data within github isn't in the single block-chained source of truth that is replicated in every cloned repo, I chafe at the idea of having different durability guarantees for my source control metadata and history (my gitlog in my clone will always be in my control).

On the tightness of histories, previously, I've used ffwd-merges and it is unfortunate that github doesn't provide that natively and that pushes are disabled for our repos. That deficiency also contorts people into having discussions like this one. With the ffwd-merge, it eliminates the superfluous "extra merge" commits that pop up on every github "merge-commit" and a lot of the arrowing/lines that you see. So, I'll flip the question around and ask why are we worrying about how to make the git logs pretty if we don't even have the right tools to do that.

Given the workflow where devs might be working on long-lived branches, CI on that development branch may be the norm, merging mainline into the dev branch repeatedly (maybe over months). When it's time to merge that feature, does it make sense to destroy the granular history and checkpoints that shows the progression of an idea through its maturation?

Here are a couple PRs that if squashed would erase critical information. https://github.com/opensearch-project/opensearch-migrations/pull/564/commits https://github.com/opensearch-project/opensearch-migrations/pull/475

The first is a medium sized PR that purposefully separates commits so that file refactoring (renames, moving functions, but not changing business logic) are separate from the rest of the changes. Should those have been separate PRs? Probably - but they were. Should they have been immediately merged/closed? Maybe - but they weren't because development time outpaced review time and other changes were right behind the refactoring PRs. They also weren't merged because they may have, at the time, provided negative value. If the subsequent changes needed to make further modifications to the newly refactored abstractions, the commit logs would have been even worse since effectively experimental commits that the author didn't think were truly ready would be in the git log at the same level as more polished commits that had more confidence.

With file changes, it is MUCH harder to review both renames, content moves, and content updates in a flattened commit than in separate commits. Merge commits always look innocuous upon first pass and can be terrifying when poring over more carefully. When dealing with squashes, we lose the ability to easily cherry-pick cross-file commits and run tests/analysis.

Lastly, I can be lazy about my commit logs too. I see that as a valuable signal later on. Here's a snippet of some of my commits (git log --oneline -- TrafficCapture/trafficReplayer)

c58661db Add logging requests that are outstanding alongside the contexts. Beefed up unit tests for the ActiveContextMonitor too.
69f2c6df Get some reasonable policies together to setup the ActiveContextMonitor to output the deepest items in the context tree.
9856daa1 WIP - wire up some repeating log statements to print out the active contexts, etc, tying their log levels to their ages.

Notice the "WIP" entry. That code ended up working better than I had originally suspected it to work. It took a couple more generations to get it cleaned up. Weeks from now, it could be useful to know if some errant code was thrown in within the first day of the project or at the very end - especially if something was broken and the wrong fix seemed to put things right. Granular histories help us get deeper into the 'whys'.

Github and git use similar, partially overlapping, workflows, but one is completely open-source and in control of everybody holding a copy of the repo and the other requires major leaps of faith. The more open-source safe way to keep this information around would seem to be to stick with the one that can work without an internet connection.

In summary, 1) I'd love to see linear histories when possible w/ ffwd-merges, or even with rebases when the changes are very small. 2) However, to me, removing the forensic evidence of how code evolved is a big loss if the only gain is to engineer a prettier history. 3) There are better ways to natively manage narratives like release notes (which can be automatically created via PR descriptions) and releases.

gregschohn commented 5 months ago

Have you considered running git log --first-parent --oneline to reduce the branch details effectively to the branch name?

48c43ea2 Make RFS runnable as Container (#550)
605057c6 RFS can now take the source snapshot (#551)
9f6f01c1 Merge pull request #549 from AndreKurait/addBuildJarTask
d264a8cc Merge pull request #547 from gregschohn/FixHangingRequests
89e3f651 Adding RFS to the Migration Assistant (#545)
b7231e90 Merge pull request #543 from AndreKurait/SupportLimitlessCodedOutputStream
792c7254 E2E Script very minor fixes (#539)
51384844 Merge pull request #538 from gregschohn/FixConnectionLeaks
0cbcdbce Merge pull request #534 from AndreKurait/LoggingFix
5d165d70 Merge pull request #537 from AndreKurait/HandleLargerFiles
13f802e9 Allow proper cleanup of all CDK resources with options (#535)

If we wanted to enforce something for better details, we could require descriptive one-line merge commits or that branch names are descriptive. Even if squashes are required, the messages could still be of diminished utility, so I'm not sure why it's worth the penalty/one-way door of reducing the information in the logs.

peternied commented 5 months ago

Have you considered running git log --first-parent --oneline to reduce the branch details effectively to the branch name?

@gregschohn Not really, I'd like git log and git lol to be easy to use without extra parameters. Reviewing the output, its the fork/branch name, that might not be helpful. IMO PR titles are far better indicators of what is in a change.


Its nearly been a week, project maintainers what happens next? It looks like votes are in

gregschohn commented 5 months ago

Taking the option for merge commits away is a one-way door. The repository becomes less expressive than it was before and we will end up with 10K line commits and no way within git to untangle the commits. If we go that way, I'd like to understand how we should be proceeding for those. Likewise, for PRs that are being completed in waves (feature work), what should the SOP be for squashing those and dealing with any inadvertent conflicts?

Before solutionizing, what problem are you trying to solve? Is there a rubric that you'd like to see the team adopt for what the history should look like? For some commits, I'll interactively rebase to squash some commits together before pushing. I'm happy to do that more often - but we do lose something there when correlating to the original PR review. Should branch names be more descriptive/should we make sure that they're accurate (personally, I use the branch name to set the PR commit line).

It's a dumb question, but having used git for over a decade, but github a relatively short time - why do people like "clean histories" for a project doing CICD with multiple branches and merge points? Aren't there other ways to make a history clean later without needing to lose information? How do people use git histories? Personally, I use it when doing detective work to find out why specific changes did or didn't happen. I can't recall ever wanting less context or granularity in those situations. Full-disclosure, I use Sourcetree which does a reasonable job for me to grok the history.

gregschohn commented 4 months ago

This decision isn't mine to make, but I'd like it to be made carefully and to consider alternatives and implication.

If the goal is a cleaner history, we could keep this & determine when we should be squashing commits or merges. In other words, when is it "acceptable" to end up with non-linear histories? What commits should be look like in the history log, etc. Should we be maintaining a separate CHANGELOG with every merge?

Of the people that voted, can we get more details in either how we'd deal with the concerns above and if there are any other solutions that would satisfy those. As the question is phrased in binary terms, it's leading & doesn't naturally lead itself to improving processes for the project or the artifacts for the repo.

peternied commented 4 months ago

Taking the option for merge commits away is a one-way door.

@gregschohn you are reaching here - your comment is hyperbolic. 1) Maintainers already squash their history 2) Commit history in forks and pull requests is not erased by enabling this setting.

peternied commented 4 months ago

Its nearly been a week, project maintainers what happens next? It looks like votes are in

This decision isn't mine to make, but I'd like it to be made carefully and to consider alternatives and implication.

@gregschohn is it? There are a number of ways different projects run, what does this project use? The ones that I'm familiar with are:

peternied commented 4 months ago

@gregschohn reflecting on your concerns. I don't think my desire for clean history in my use case should force the point for merge commits not to present in the branch history. As I haven't really used merged commits maybe this is a good chance for me to try it out and see if I come away with any specific concerns.

With that said, I don't see a need to push for this change and suggest that the issue be closed out without any action taken. As we get more data we can always revisit if we feel its worth revisiting.

peternied commented 4 months ago

Its been a couple weeks, closing this out thanks!