josh-project / josh

Just One Single History
https://josh-project.github.io/josh/
MIT License
1.47k stars 55 forks source link

Is there any way to sync fewer merge commits when extracting a subfolder from a large repo? #1328

Open RalfJung opened 5 months ago

RalfJung commented 5 months ago

I've asked this before in https://github.com/josh-project/josh/issues/952 and was told "no", but maybe it's worth asking again... as other projects that have a presence in the Rust compiler monorepo are considering josh, we're seeing initial josh syncs that add more than 10k commits from the parent repo, meaning that about 1/3 of the commits in that history are not actually from the subproject. In Miri we have accumulated at least around 3500 of these commits (it's hard to reliably find them all so this is a lower bound), which is more than a quarter of the commits in Miri. rust-analyzer seems to be doing better, "only" getting around 1500 commits in the initial sync, which is around 5% of the total commits in that history.

When adding an external repo as a subdirectory into a monorepo (what you want to do) Josh guarantees that splitting that subdirectory back out will yield the exact same sha1's like the original repo. (Most, if not all, other filtering tools do not make that guarantee)

These commits do not originate from the subrepo, so we don't need them to be preserved. But of course josh has no way of knowing that... maybe if we could tell it which part of the history is originally from the subrepo ("this is the subrepo HEAD, everything above this, if it exists in the parent repo, must be extracted perfectly"), it could be more "sloppy" on the remaining history? But I can see how that could be anything between tricky and complete nonsense...

Cc @flip1995 @lnicola

christian-schilling commented 5 months ago

Interesting 🤔

"It's hard to reliably find them" also means: "it's hard to implement a rule to skip them".

But I have been looking again at the history you posted in #952. A lot of the merges look degenerate. the second parent is an ancestor of the first parent. I wonder if that would be a rule to use to filter them out. Or maybe: If the second parent is an ancestor of the first and the diff with the first parent is empty, skip the commit.

The current rule is to keep a merge if the diff with either parent is non-empty. And to never change the number of parents a commit has.

RalfJung commented 5 months ago

"It's hard to reliably find them" also means: "it's hard to implement a rule to skip them".

Yeah, I understand. And obviously keeping the round-trips working is very important.

Even if you found a better heuristic now, I have no idea how we'd migrate to that given that we've already integrated all these merges into Miri's history. Though maybe it would be worth a single force-push to Miri...

christian-schilling commented 5 months ago

Even if you found a better heuristic now, I have no idea how we'd migrate to that given that we've already integrated all these merges into Miri's history. Though maybe it would be worth a single force-push to Miri...

This could also be implemented in a new filter, or as an option for the :linear filter.

flip1995 commented 2 months ago

I'm still a bit hesitant with moving Clippy over to Josh, because of this comment:

I have no idea how we'd migrate to that given that we've already integrated all these merges into Miri's history. Though maybe it would be worth a single force-push to Miri...

If this should be implemented, is there a way to move from the current :rev filter we're using to the new implementation without force pushing? If not, are you planning to work on this issue? Is there anything I can help with here?

If this is not planned to be implemented soon, I'll probably move to Josh anyway with the :rev filter, as I really want to get rid of git subtree :sweat_smile:

christian-schilling commented 2 months ago

Sorry for the late response... So to answer the first question: If the behaviour is changed and you want to migrate, a force push will be necessary. As the current behaviour is already relied upon, the new one (if my idea even work, we don't know that for sure yet) will be opt-in and supported alongside the current one. Regarding planned work on it: I think I could look into it sometime next week.

flip1995 commented 2 months ago

Thanks for looking into this! I can definitely still wait a week or two or longer, if your idea works out but you need more time implementing it.