Open xaviershay opened 6 years ago
@rspec/rspec feedback please :)
Background reading on monorepos:
We don't have enough code (nor should we ever) to start hitting some of the bigger downsides of a monorepo.
And also my own relevant background: I was responsible for the CI teams at Square for many years, dealing with a large Java monorepo and many disparate ruby repositories.
I support this idea.
I very much support this idea. The amount of overhead we deal with due to having separate repos is demoralizing. Elixir (the main language I dealt with the pass couple years) has first-class support for monorepos with many sub-applications via its umbrella feature and I'm a big fan of the approach when you're tooling supports it.
There's also another potential benefit we get out of a monorepo: it can greatly lessen the effort involved in extracting a feature out of the core gems and into an extension gem. For example, if we ever want to extract any_instance
into a separate gem so it doesn't ship with rspec-mocks
(something I'd advocate we consider for RSpec 4), with our current approach, extraction would be difficult. We'd have to refactor rspec-mocks to promote some new things to first-class extension APIs so the any_instance
code could use it. That would be a lot of effort. With a monorepo, we could potentially move the any_instance
code into a sub-directory of the monorepo, w/o needing to define new extension APIs. Instead, we can just include the any_instance
spec suite in our travis build, so we can keep ourselves from breaking the any_instance
logic.
possibly rspec-rails into a single repository.
I'm in favor of not merging rspec-rails. It's pretty different, and not part of core RSpec. The CI build is way different with all the rails versions it supports. And we've talked about potentially de-coupling it's versioning from the versioning of the rest of RSpec.
Given a motivation is to simplify issue reporting, we should migrate them all to the main rspec project (which current has issues disabled) using something like github-issues-import. We can use tags (mocks, etc...) to track what came from where, but for new issues they would be optional.
Great idea. A few additional thoughts:
This will include closing many stale PRs, that will need to be re-opened against the new monorepo.
Honestly, I think we can just close stale PRs, with a note saying the person can re-open a new PR on the monorepo if they want to see it get merged. I don't think we need to go through the effort of manually importing PRs (and automating it sounds hard). Cleaning up open PRs would be really helpful as a preliminary step.
Run all the specs for each project on every CI. Conceptually the simplest, potential downside is runtime. Given a lot of build time is build setup (installing ruby, gems, etc...) I suspect this will work out fine.
I would favor this approach over the more complicated analysis approach you mentioned. I think we can optimize this using a well-designed build matrix. The cukes are generally the slow part of our build. I'd favor doing something like:
So that would be 4x matrix entires per supported Ruby version...which sounds bad, but isn't actually any worse than our current situation, where we've got 4 builds (core, expectations, mocks, support) per supported version. And it should run simpler and faster.
We can consider not running the cukes on all supported ruby versions if there's a concern about the number of matrix entries.
This is where rspec-rails would be an interesting addition, because it has a test matrix with rails versions that isn't applicable to the other projects and would probably force us to option 2. I'm leaning towards keeping it out, at least to begin with. We can always incorporate it later!
I don't think I can even wrap my head around how we'd incorporate rspec-rails in a monorepo build. I very much advocate for keeping it separate. It's a special beast.
We need to audit rspec-dev to see what things it does are still appropriate and how to best incorporate them into the monorepo. Given it's primarily used to "manage" multiple repos, my hope is it can be mostly deprecated.
There's a couple different things rspec-dev provides:
Rakefile
of the monorepo.I don't think we need to stringently audit rspec-dev and try to bring everything over--instead, we can just bring things over as we have the need for a task.
Create a new project to test this idea (rspec-monorepo), including setting up CI and incorporating rspec-dev.
This is a great idea. Would rspec-monorepo
be a prototype, and then we'd re-do the merge on rspec/rspec
for real? A couple thoughts/ideas:
rspec-monorepo
should probably start off as a fork of rspec/rspec
so you can test out merging into an existing repo with history.rspec
repo. Automating the process with some scripts is probably good, although as a one-time process we shouldn't try to go overboard.One thing you didn't mention: how are we going to actually merge the source history of the repos? Is there a good tool for that? Obviously, all the SHAs are going to change. If possible, it would be really cool if whatever tool we used to import commits amended the message of each commit with a note like [Imported from rspec/rspec-core@abc1239237e8234bedf]
so there's a clear link from each commit to the original. And we could potentially create a link the other way, too: since github lets you leave comments directly on commits, our import tool could leave an automated comment on each commit linking to its new location in the monorepo. Although, given the number of commits, I can imagine that would be super slow....so probably not worth it.
While I'm 100% on board with this idea, I can't say I'm excited (or particularly available) to do much of the work here. Is there someone who is willing to lead this effort? (Looking at you, @xaviershay :)...)
I'm on board with the mono repo idea, would certainly simplify keeping everything in sync!
On the CI side I'd recommend we not split out the various builds into matrices, as it will (ironically) slow the build down. Travis seems to limit the number of in progress builds at a time (one of the reasons why rspec-rails
builds are so slow), which is fair given its a free service, so having one build per Ruby version will probably work out faster in practise.
π for the monorepo idea.
Some ideas for the import of commits:
Some ideas for the import of commits:
Oh interesting, I was just planning to start a new repo and copy the code over, with initial commit being "see the old repos". I can play around with various ways of importing history ... guess it should be pretty straight forward since there's no way they could conflict with one another.
Oh interesting, I was just planning to start a new repo and copy the code over, with initial commit being "see the old repos". I can play around with various ways of importing history ... guess it should be pretty straight forward since there's no way they could conflict with one another.
I think for the prototype repo, copying in all the code is OK so you can work on getting CI setup, etc., but for the final merge I think the extra effort to preserve the source history as much as possible is valuable. I like @yujinakayama's ideas.
Sorry I have made basically no progress on this. I'm going to de-commit from it rather than let it go on in limbo.
I made this prototype repo to start playing around with ideas if anyone wants to pick it up: https://github.com/rspec/rspec-monorepo-prototype
We could merge the other repos while keeping all the history. I had some success with the steps described here (https://thoughts.t37.net/merging-2-different-git-repositories-without-losing-your-history-de7a06bba804):
git clone git@github.com:rspec/rspec-core.git
git clone git@github.com:rspec/rspec-expectations.git
cd rspec-core
git remote add rspec-expectations ../rspec-expectations
git fetch rspec-expectations
git co -b merge-rspec-expectations
git merge -S --allow-unrelated-histories rspec-expectations/master
Recently I've been working on my project RepositoryMerger which satisfies the requirements for the repository migration, and finally it's almost complete.
Here's a prototype monorepo:
https://github.com/yujinakayama/rspec-monorepo
The largest benefit of the monorepo generated by RepositoryMerger is that we can handle the past codebase and commits in the same way as we do after the migration.
git checkout 3-10-maintenance
and just run cd rspec-core; bundle exec rspec
as always. This is equivalent to rake git:checkout[3-10-maintenance]
and cd repos/rspec-core; bundle exec rspec
in rspec-dev
.git bisect
to find the commit that introduced a bug.Though I think the monorepo generated by RepositoryMerger is basically intuitive and easy to handle, below is a brief explanation of the behavior for review from the core team members.
First we need to choose target branches, which we consider important as history and want to import into the monorepo. In other words, commits unreachable from the target branches in the original repos (e.g. stale pull requests) won't be imported. In our case, the target branches should be main
and the maintenance branches (2-2-maintenance
... 3-10-maintenance
).
Then we can import commit history for each branch one by one in the following way:
configuration = RepositoryMerger::Configuration.new(
original_repo_paths: %w[rspec rspec-core rspec-expectations rspec-mocks rspec-support],
monorepo_path: 'monorepo'
)
repo_merger = RepositoryMerger.new(configuration)
repo_merger.merge_commit_history_of_branches_named('main')
repo_merger.merge_commit_history_of_branches_named('2-2-maintenance')
repo_merger.merge_commit_history_of_branches_named('2-3-maintenance')
# ...
repo_merger.merge_commit_history_of_branches_named('3-10-maintenance')
The imported files are placed under subdirectories for each original repo in the same manner as rspec-dev
.
For simplicity, let's consider a case of importing only a single set of branches from two repos: creating main
branch in the monorepo by importing the two main
branches from rspec-core
and rspec-expectations
:
repo_merger.merge_commit_history_of_branches_named('main')
RepositoryMerger imports all the commits from the original branches with the following rules to preserve the original commit graph structure as much as possible:
Now let's consider a case of importing two sets of branches from two repos: creating main
and 3-0-maintenance
branches in the monorepo by importing the two sets of branches from rspec-core
and rspec-expectations
:
repo_merger.merge_commit_history_of_branches_named('main')
repo_merger.merge_commit_history_of_branches_named('3-0-maintenance')
The first import phase for main
does exactly the same processing as the case above. In the second phase for 3-0-maintenance
, it reuses already imported commits if possible instead of creating new ones so that main
and 3-0-maintenance
branches in the monorepo will share some of the commits until they diverge:
This result comes as no surprise.
However, it may produce a surprising result if original repos have complex commit graph structures. For example, let's consider a case where the main
branch in rspec-core
has been merged into the 3-8-maintenance
through a bug fix branch at a point after they diverged:
In this case, RepositoryMerger intentionally creates duplicate commits, instead of reusing commits that would cause commit contamination. As a result, the commit graph seems reassembled. At first glance this may look strange, but if you look at each branch this is reasonable to avoid commit contamination between target branches.
I've implemented the following commit message conversions for convenience and clarity:
Fix a bug
in rspec-core -> [core] Fix a bug
Closes #123
in rspec-core -> Closes rspec/rspec-core#123
This commit was imported from https://github.com/rspec/REPO/commit/SHA1.
to every commit.Examples:
Tags are suffixed with their original component name: v3.10.0
in rspec-core is imported as v3.10.0-core
.
Through trial and error in the development of RepositoryMerger, I've found a strange fact that some old tags in our current repos point orphan commits that have different root commits from the main
branches.
These are the tags with orphan commits:
v2.11.2
v2.11.3
v2.2.1
v2.3.1
v2.5.1
v2.5.2
v2.6.1
v2.6.2.rc
v2.6.2
v2.6.3.beta1
v2.6.3
v2.6.4
v2.7.1
Visualized commit graphs generated with the following command (note that you cannot distinguish the orphan commits with --oneline
format):
git log --graph --format='%ci %C(yellow)%h%Creset%Cgreen%C(bold)%d%Creset %s%n' --simplify-by-decoration '--remotes=*-maintenance' origin/main --tags
rspec-core:
rspec-mocks:
I'm not sure how we should handle these tags. Technically importing them into monorepo is possible but it should cause confusion since the imported branch will also be an orphan one and contain a lot of duplicate commits.
@myronmarston Do you know how these tags have been created?
@rspec/rspec What do you think?
@yujinakayama: amazing work! There's a ton of nuance and complexity here that I didn't realize.
On a side note, I've gotten a bunch of emails from github during this process due to imported commits that @mention
me:
I was wondering what was causing this; now I know :). In theory it would be nice to prevent any additional test runs that you do from sending those emails (I imagine I'm not the only one getting them...). One idea would be to change @mention
to @-mention
in commit messages in any test runs (while keeping a proper @mention
in the final import run).
There's one key invariant that seems really important to verify: for each imported branch, the file system contents of a gem subdirectory must match the file system contents of the same branch in the original repository. For example, on the 3-10-maintenance
branch of the rspec-core
repository, the file system contents should match the contents of the rspec-core
directory of the 3-10-maintenance
branch. I'm not sure of the git command but I believe there's a way to get the SHA hash of the git tree object that represents the file system contents of a directory--what do you think about adding a check to your script that the tree hash is the same for a gem directory on the imported repo as it was in the source repo?
I've implemented the following commit message conversions for convenience and clarity:
These all sound good, but I think there's one more that would be helpful: when a commit SHA is mentioned in a commit message (e.g. Fixes regression introduce in 2114c9b25ffd1d01434c18c131ab9015ad3dcf18
), it would be nice if the SHA was updated to the SHA of the corresponding imported commit in the imported repo. Is that doable? I suspect SHA references would show up in two forms (just the SHA, and a URL to the SHA on github).
@myronmarston Do you know how these tags have been created?
Not off the top of my head. I have no memory of it. Let me dig into it and get back to you.
Also, one more thing to consider: what github repo are we going to use for this? In theory rspec/rspec
would be ideal but it's a repository that already has a history, and that may not play well with the import process. Then again maybe there's a way to at least preserve its master
/main
branch history in a named branch (old-rspec-main
or something) while importing into other branches that don't treat any of the existing commits in that repo as parents. Or maybe it's best to do it in a new repository.
Thoughts?
@myronmarston Do you know how these tags have been created?
I spent some time looking into this and it's super weird. I have no specific memories of how that happened. I took a look at rspec-core v2.11.3 (and v2.11.2) and they have a 100% alternate history, all the way back to the first commit, so that they actually have no shared commits with the main branch. I was thinking maybe someone rebased and force pushed but I don't see how that could even cause this state. Maybe it's possible it got corrupted some how? Although with git's hash-based model I don't really see how that could happen either.
In theory, there might be a way to "repair" it. If you check out the v2.11.2 and v2.11.3 tag I can see there's a commit in the history that claims to be the 2.11.1 release, so you could perhaps branch off of the v2.11.1 tag, and cherry-pick each commit after the alternate 2.11.1 release commit up through the v2.11.3 tag on the new branch. If the cherry-picks applied cleanly (a big if...) you'd have a "clean" 2-11 branch that you could use during the import process.
....but with all that said, I don't think it's worth the effort to try to fix these orphaned branches. I can't imagine anyone is going to use the old 2.x maintenance branches for anything ever again. And it's worth remembering that the import process is not destructive: the original rspec-core
, rspec-mocks
, etc repositories will be left in place as-is, right? These odd orphaned branches will remain available in those repositories.
@myronmarston Thanks for the feedback!
One idea would be to change
@mention
to@-mention
in commit messages in any test runs (while keeping a proper@mention
in the final import run).
Oh, sorry. I didn't imagine that the mentions were included in the commit messages. I'll modify them as you suggested next time I rebuild the monorepo.
There's one key invariant that seems really important to verify: for each imported branch, the file system contents of a gem subdirectory must match the file system contents of the same branch in the original repository.
Yeah, it's tested in this spec by comparing lists of files with digests. I didn't think of your approach with git tree objects, though, I'll add spec examples with that approach for more confidence. I just confirmed that it worked by hand ππ»:
$ cd original_repos/rspec-core
$ git cat-file -p '3-10-maintenance' | grep tree
tree 20b8a7de9ca320fec376e559c30e5548655753a2
$ cd ../rspec-expectations
$ git cat-file -p '3-10-maintenance' | grep tree
tree 89fac6ca71bccfef0bea2f5ff0bd650376e17d81
$ cd ../rspec-mocks
$ git cat-file -p '3-10-maintenance' | grep tree
tree f49bb412ec577ec313cfae97e157d8a900d01b8e
$ cd ../rspec-support
$ git cat-file -p '3-10-maintenance' | grep tree
tree a351b573892f9328c78a96de9479d75058060d53
$ cd rspec
$ git cat-file -p '3-10-maintenance' | grep tree
tree 1c5701ea1b875c9b76ac2c05af76abded8f904f4
$ cd ../../monorepo
$ git cat-file -p '3-10-maintenance^{tree}'
040000 tree 20b8a7de9ca320fec376e559c30e5548655753a2 rspec-core
040000 tree 89fac6ca71bccfef0bea2f5ff0bd650376e17d81 rspec-expectations
040000 tree f49bb412ec577ec313cfae97e157d8a900d01b8e rspec-mocks
040000 tree a351b573892f9328c78a96de9479d75058060d53 rspec-support
040000 tree 1c5701ea1b875c9b76ac2c05af76abded8f904f4 rspec
when a commit SHA is mentioned in a commit message (e.g.
Fixes regression introduce in 2114c9b25ffd1d01434c18c131ab9015ad3dcf18
), it would be nice if the SHA was updated to the SHA of the corresponding imported commit in the imported repo. Is that doable? I suspect SHA references would show up in two forms (just the SHA, and a URL to the SHA on github).
It's possible, but I noticed a concern as I extracted commit references in the URL format, there're some references that point commit comments on GitHub. If we replace them with URLs pointing the imported commits, we cannot see the comments unless follow the link This commit was imported from ...
.
I think we have two options for this:
user/repo@SHA
format that point old commits (i.e. we don't touch SHA).Also, one more thing to consider: what github repo are we going to use for this? In theory
rspec/rspec
would be ideal but it's a repository that already has a history, and that may not play well with the import process. Then again maybe there's a way to at least preserve itsmaster
/main
branch history in a named branch (old-rspec-main
or something) while importing into other branches that don't treat any of the existing commits in that repo as parents. Or maybe it's best to do it in a new repository.
I'd prefer the new repository approach since keeping the merged contents and the old metagem contents together should confuse the future maintainers, like our current situation about the orphaned branch.
Currently I'm thinking the following plan:
rspec
repository as rspec-metagem
or something (without renaming the gem itself). This should be done a few months before the actual migration (maybe now). This is a relatively safe operation since GitHub automatically redirects accesses until we create a new repository named rspec
.rspec-monorepo
as a preparation.rspec
.In theory, there might be a way to "repair" it. If you check out the v2.11.2 and v2.11.3 tag I can see there's a commit in the history that claims to be the 2.11.1 release, so you could perhaps branch off of the v2.11.1 tag, and cherry-pick each commit after the alternate 2.11.1 release commit up through the v2.11.3 tag on the new branch. If the cherry-picks applied cleanly (a big if...) you'd have a "clean" 2-11 branch that you could use during the import process.
....but with all that said, I don't think it's worth the effort to try to fix these orphaned branches. I can't imagine anyone is going to use the old 2.x maintenance branches for anything ever again. And it's worth remembering that the import process is not destructive: the original
rspec-core
,rspec-mocks
, etc repositories will be left in place as-is, right? These odd orphaned branches will remain available in those repositories.
I was thinking exactly the same thing. Repairing them might be possible (I think another issue should arise in the process, though π), but it's a sort of "history rewrite" which is something I'm reluctant to do if possible.
By the way, there're some missing maintenance branches where no patch version has been released for the minor version. Since they are needed to merge branches properly (I've been using my forked repos in the deveploment), I'll run the following commands and push the branches:
cd rspec
git branch 2-2-maintenance v2.2.0
git branch 2-3-maintenance v2.3.0
git branch 2-5-maintenance v2.5.0
git branch 2-6-maintenance v2.6.0
git branch 2-7-maintenance v2.7.0
git branch 2-9-maintenance v2.9.0
git branch 2-10-maintenance v2.10.0
git branch 2-11-maintenance v2.11.0
git branch 2-13-maintenance v2.13.0
cd rspec-core
# v2.13.1 is included in the main branch
git branch 2-13-maintenance v2.13.1
cd rspec-expectations
git branch 2-2-maintenance v2.2.0
git branch 2-3-maintenance v2.3.0
git branch 2-5-maintenance v2.5.0
git branch 2-6-maintenance v2.6.0
git branch 2-10-maintenance v2.10.0
git branch 2-13-maintenance v2.13.0
cd rspec-mocks
git branch 2-2-maintenance v2.2.0
git branch 2-3-maintenance v2.3.0
git branch 2-5-maintenance v2.5.0
git branch 2-6-maintenance v2.6.0
git branch 2-7-maintenance v2.7.0
git branch 2-9-maintenance v2.9.0
git branch 2-10-maintenance v2.10.1
For the issue of the orphaned branches, I found new facts, which are good news:
v2.11.2
and v2.11.3
in rspec-core is actually a partial duplication of rspec-mocks repo.v2.2.1
... v2.7.1
in rspec-mocks is actually a partial duplication of rspec-core repo.We can say that they have exactly the same history and contents because they have the same commit sha.
Also, versions with the problematic tags haven't been released as gems:
Though I still cannot imagine how these branches have accidentally been created, we no longer need to concern about them.
I tried to transfer https://github.com/yujinakayama/rspec-monorepo-migration to rspec organization but failed since I don't have the permission to create public repos on rspec organization.
@JonRowe Could you give me the permission?
@yujinakayama I think you have permission now
@JonRowe Thanks, it worked.
I'm very much in favour of running forward with a monorepo, perhaps as part of the RSpec 4 release process, there will be a lot of massaging PRs and issues running forward and I think we will have to mark the old repos as "archived" with large notices in the read me's.
I'm going to go ahead and start this by renaming the current meta gem repo.
Thank you so much for working on this @yujinakayama
@yujinakayama Fantastic job, thank you!
there will be a lot of massaging PRs and issues
@JonRowe I have some time to spare in November/December.
Well, this thread is quite old, and I know @JonRowe is actively mono-repoing now, but I wanted to check if you've considered the incremental approach - instead of trying to import everything to one place all at once, you could start by shifting two of the repositories together. It'd reduce the time-frame and friction (and the amount of in-flight juggling), and let you iron out the kinks with a smaller working set. Then you could import the other gems to it one at a time afterward as time allows.
The payoff isn't as immediate, but I'd find it a lot less stressful, were it me :-)
The main blocker at the moment is the build matrix for the monorepo, I've made good progress on unifying the build I just need to find time to finisht that. By using a Github action to run the build its no more complicated for all the gems than it is for 2, so doing it incrementally would increase the amount of work as you'd have to tie into the existing builds
Summary
Combine
rspec-core
,rspec-mocks
,rspec-expectations
,rspec-support
,rspec-dev
(which would be mostly deprecated with this proposal), therspec
meta-gem, and possiblerspec-rails
into a single repository.Motivation
Making cross-cutting changes to RSpec involves a complicated dance of PRs. Selected examples:
rspec-dev
sync. As well as been a pain at the time, this has caused confusion for a subsequent contribution.rspec-support
to fix a bug in (say)rspec-mocks
requires a PR to support, a PR to mocks pointing at that branch to demonstrate that the support change works, merging the support PR, then redoing the mocks PR to point back at master. (e.g. https://github.com/rspec/rspec-mocks/pull/726)rspec-dev
tasks to "sync" configuration to all repos requires a PR for each repo.It's also not always clear to issue reporters which project is appropriate for an issue. We shouldn't expect newcomers to know which sub-project an issue could be related to!
Approach
Issues
Given a motivation is to simplify issue reporting, we should migrate them all to the main
rspec
project (which current has issues disabled) using something likegithub-issues-import
. We can use tags (mocks
, etc...) to track what came from where, but for new issues they would be optional.Inflight PRs
We should explicitly close or merge every open PR. This will include closing many stale PRs, that will need to be re-opened against the new monorepo. We can mitigate this with a notice period (i.e. "one month from now we're merging repos").
CI
I see two ways to set up CI:
This is where
rspec-rails
would be an interesting addition, because it has a test matrix with rails versions that isn't applicable to the other projects and would probably force us to option 2. I'm leaning towards keeping it out, at least to begin with. We can always incorporate it later!For now we should benchmark option 1.
rspec-dev
We need to audit
rspec-dev
to see what things it does are still appropriate and how to best incorporate them into the monorepo. Given it's primarily used to "manage" multiple repos, my hope is it can be mostly deprecated. This will be investigated in the prototype suggested below.Drawbacks and mitigations
Next steps
rspec-monorepo
), including setting up CI and incorporatingrspec-dev
.