thaljef / Pinto

Curate your own repository of Perl modules
https://metacpan.org/module/Pinto::Manual
66 stars 49 forks source link

What are the plans for merge? #52

Open schwern opened 11 years ago

schwern commented 11 years ago

I just noticed that merging stacks disappeared. This is an important part of the value of Pinto to me. Delete + copy is enough to do a dev -> staging -> production workflow, but if there's any hot patches they'll have to be moved back manually.

thaljef commented 11 years ago

merge is another feature that fell out when the VCS was redesigned. I always thought it could be pretty useful too, but when I asked around, no one claimed to use it or even understand the use case for it. I'd be happy to bring it back if we can both agree on what it ought to do.

So in your mind, what does it mean to "merge" two stacks?

schwern commented 11 years ago

In all cases B is being merged into A. * is the repository root. A and B are branch heads.

First is the simple fast-forward case. No change has been made to B since A branched from it.

*---B----A

becomes

*--------A
         B

The opposite is either a no-op or an error because B is not an ancestor of A.

*---A----B

becomes

*---A----B
or error

Now for the interesting part, when both branches have diverged. A use case is where you have multiple project branches (we'll call them P1 and P2) branched from a baseline branch (A). For example, if there's a bug fix in File::Spec or similarly universally important module that might go into A and be pushed out to the projects.

       _____P1
      /
*-------A
     \____P2

You want to merge changes to the baseline set of modules (A) into the project branches. What happens? To figure that out we have to break down the concept of what can be automatically merged and what cannot be.

Merging conflicts are caused by two changes trying to change the same line or nearby lines. This has two goals. First, with a direct conflict the computer has no way to figure out which change takes precedence, it needs a human to make that decision.

Second, checking the surrounding context hasn't changed is an attempt to try to make sure the change is still relevant. That the context hasn't changed to make this change make no sense.

To sum, a merge can be done automatically unless...

  1. Both sides are changing the same thing in different ways.
  2. One side has changed something in a way to make the other side's change not make sense.

What Pinto has going for it is most merge systems have no understanding of the thing they're merging. It's just lines of text, not code. Pinto does understand.

Final clarifying point. All "changes" referred to are since the last time A was merged into P1. This is so that if A and P1 both touch Foo::Bar, you merge and resolve the conflict, that the next merge doesn't repeat the Foo::Bar conflict. This is where Pinto must do merge tracking, to remember what was merged into who and when. This is where Subversion fell flat on its face. It's not easy unless you have a tree structure for your repository like Git. The Subversion merge tracking model is crap, but it is relatively simple to implement.

For the first point, if both branches change the same module it is a conflict. Pretty straight forward. This includes pins.

A merge cannot violate any pins, that is a conflict.

For the second point, there's a few ways to interpret an out of context change. First, for each changed module Pinto must check that it doesn't break the dependency requirements of anything else. Pinto can already do this.

A first order merge algorithm for merging A into P1 might be...

Conflict resolution must be interactive and the user must be able to continue the merge after resolving the conflict. This is to say it's not OK for Pinto to just error out of the merge on conflict.

Also note that the dependency check must take into account ALL of @A_Changes at once. That is to say, if A upgrades Foo::Bar which requires an upgrade to Some::Module, Pinto should wait to check if @A_Changes is going to upgrade Some::Module, too. I've run into this problem with "pinto add".

How's that?

thaljef commented 11 years ago

How's that?

I think we're getting close. I've gotta run, but I wanted to jot down some thoughts before I forget...

schwern commented 11 years ago

Merging packages vs distributions... if I understand you correctly "merge by package" would treat a change like these separate actions...

"Merge by distribution" treats the same thing as one change.

This again hits at that core architectural problem of CPAN, it has no concept of a distribution. Ideally, CPAN and Pinto should never have two releases of the same distribution in the index. In that sense, merging by distribution is what you want. From another perspective, the Pinto user is always working with releases of distributions. It doesn't make sense to say "I want to upgrade JUST Test::More but leave Test::Builder as is even though they're part of the same distribution". Since Pinto presents the user with releases of distributions, merging should be by releases of distributions.

The analogy to adjacency and ordering in Pinto is the dependency graph.

Good to hear Pinto is a graph rather than a stack of pancakes. :) That will make this so much easier.

I'm not sure what a "hypothetical index" is. Example?

thaljef commented 11 years ago

I'm not sure what a "hypothetical index" is. Example?

Using a couple of packages, describe what the index looks like at each point (the common ancestor, tip A, and tip B) and show what the new tip should look like after the merge. For example:

ancestor                               B   --------merges into------>  A  --------- yielding --------> A'
_________________________________________________________________________________________________________________
Author/Dist-1/PkgA~1        Author/Dist-1/PkgA~1              Author/Dist-1/PkgA~1           Author/Dist~1/PkgA~1
Author/Dist-2/PkgB~2        Author/Dist-2/PkgB~2              Author/Dist-2/PkgB~2           Author/Dist~2/PkgB~2
<none>                      Author/Dist-3/PkgC~3              <none>                         <conflict ???>
Author/Dist-4/PkgD~2        Author/Dist-5/PkgD~3              Author/Dist-6/PkgD~4           <conflict>

Something like that :)

thaljef commented 11 years ago

Merging packages vs distributions...

We're going to have to sort that part out first. I'll get back to you on that tomorrow.

thaljef commented 11 years ago

Here's my feeling at the moment...

Distributions are purely notional. They can't be reliably ordered or even related to each other. Pretending that they can goes against the grain of the toolchain and smells like a recipe for trouble. I don't believe it is an "architectural problem" of CPAN. Rather, it is just a natural side-effect from having module-level dependencies.

I don't agree that limiting Pinto to one release of a distribution per stack is necessarily "ideal." In your use case it might make perfect sense. But for others, it may not. Some may wish to leave old distributions in the index just like PAUSE does. Fortunately, Pinto lets you go either way. But I don't feel that I can say either strategy is "right."

A Pinto user is not "always working with releases of distributions" as you say. I think you may feel that way because you want to make a repository full of old versions of modules, and the only way you can do that with Pinto right now is to ask for a specific release of a distribution.

But if Pinto can precisely locate (module => version) pairs for you, then releases become largely invisible to you. In fact, that very same issue is discussed in #37 and is the focus of our crowd-funding campaign at http://tinyurl.com/gopinto.

So in terms of merging, I think it means that merges are done at the package level. However, I can probably still preserve the all-or-nothing behavior that you want for distributions. We'll just have to sketch out some test cases for that and make sure we both understand.

schwern commented 11 years ago

On 4/10/13 5:34 AM, Jeffrey Ryan Thalhammer wrote:

Distributions are purely notional. They can't be reliably ordered or even related to each other. Pretending that they can goes against the grain of the toolchain and smells like a recipe for trouble. I don't believe it is an "architectural problem" of CPAN. Rather, it is just a natural side-effect from having module-level dependencies.

Returning from the Perl QA Hackathon, CPAN feels this is an architectural problem and are moving to revise it. An installed distributions database is in the works, distribution level CPAN meta data, and new standards for Foo-Bar-1.23 to contain Foo::Bar.

For Pinto, if pull is going to treat things as releases then why not merging?

In addition, Pinto is in the position where it knows what is currently installed and what came with each release. It has an installed releases database. It can determine if two releases intersect with each other simply by parsing the archive name or by checking if they change the same modules. For the purposes of merging (and many other things) it's important to know if two tarballs are going to change the same package.

Let me go through an example of a merge to show you what I mean and how it works out better by distribution.

Foo-1.22 contains...
    Foo @ 1.22
    Foo::Bar @ 1.22
    Foo::Baz @ 1.22
    Foo::Buf @ 1.22

Foo-1.23 contains...
    Foo @ 1.23
    Foo::Bar @ 1.23
    Foo::Baz @ 1.23

Foo-1.24 contains...
    Foo @ 1.24
    Foo::Bar @ 1.24
    Foo::Buf @ 1.24

What happens? Here's the state just before merging begins.

          master          feature        fix

Foo 1.22 1.23 1.24 Foo::Bar 1.22 1.23 1.24 Foo::Baz 1.22 1.23 1.22? Foo::Buf 1.22 1.22? 1.24

The question marks indicate places where if we change pull to first uninstall there will be no version installed. Pull would have uninstalled it.

If we do it by package, then the user is presented with this...

+Foo @ 1.24 -Foo @ 1.23 +Foo::Bar @ 1.24 -Foo::Bar @ 1.23 +Foo::Baz @ 1.22 -Foo::Baz @ 1.23 +Foo::Buf @ 1.24 -Foo::Baf @ 1.22

Note the bogus downgrade of Foo::Baz. In this case it doesn't matter, but I'm pretty sure I can contrive cases where it will.

In addition, there is really only two valid outcomes...

Equivalent to Foo-1.24

Foo @ 1.24 Foo::Bar @ 1.24 Foo::Baz @ 1.22 Foo::Buf @ 1.24

OR

Equivalent to Foo-1.23

Foo @ 1.23 Foo::Bar @ 1.23 Foo::Baz @ 1.23 Foo::Baf @ 1.22

I can think of no use case where you'd want to mix and match versions of modules from distributions. Allowing the user to do so just begs their making a mistake when merging.

In addition, it's a lot for the user to parse. And that's for a small module. When merging, here's what I'd rather see...

+Foo-1.24 -Foo-1.23 +Foo @ 1.24 -Foo @ 1.23 +Foo::Bar @ 1.24 -Foo::Bar @ 1.23 +Foo::Baz @ 1.22 -Foo::Baz @ 1.23 +Foo::Buf @ 1.24 -Foo::Baf @ 1.22

What that clearly says is "Foo was upgraded from 1.23 to 1.24".

Conflicts should be determined by two branches changing the same module, but this should cause a conflict at the archive level because it is the archives which are in conflict.

Looking at it another way, the user isn't going to cherry pick Foo from Foo-1.23 and Foo::Bar from Foo-1.24 to resolve this conflict. They're going to pick Foo-1.23 or Foo-1.24. In fact, conflict resolution could be presented like this...

Merging fix into feature...
Conflict: Both branches update the Foo distribution

    parent:  Foo-1.22
    fix:     Foo-1.24
    feature: Foo-1.23
                      parent          feature        fix
        Foo           1.22            1.23           1.24
        Foo::Bar      1.22            1.23           1.24
        Foo::Baz      1.22            1.23           1.22
        Foo::Buf      1.22            1.22           1.24

Which version would you like?

Nice, clean presentation and a clear choice for the user which they cannot screw up.

(I'm not sure how any of the above would interact with prerequisites)

I don't agree that limiting Pinto to one release of a distribution per stack is necessarily "ideal." In your use case it might make perfect sense. But for others, it may not. Some may wish to leave "old" distributions in the index just like PAUSE does. Fortunately, Pinto let's you go either way. But I don't feel that I can say either strategy is "right."

A) What's the use case? B) Would you agree it's not a common case? C) Can we make that a configurable option defaulting to no?

A Pinto user is not "always working with releases of distributions" as you say. I think you may feel that way because you want to make a repository full of old versions of modules, and the only way you can do that with Pinto right now is to ask for a specific release of a distribution.

But if Pinto can precisely locate (module => version) pairs /for you/, then releases become largely invisible to you. In fact, that very same issue is discussed in #37 https://github.com/thaljef/Pinto/issues/37 and is the focus of our crowd-funding campaign at http://tinyurl.com/gopinto.

As I understand it, even if a user types in "pinto pull Foo::Bar" on the command line, there is no difference between that and "pinto pull <the release containing the latest Foo::Bar>". That is, it's just syntax sugar for pulling a release.

Is there a commonly used command which allows the user to manipulate Pinto at the individual package level which is not syntax sugar for working with an archive? For example, can I change Test::More without changing Test::Simple and Test::Builder and is this a commonly done thing?

So in terms of merging, I think it means that merges are done at the package level. However, I can probably still preserve the all-or-nothing behavior that you want for distributions. We'll just have to sketch out some test cases for that and make sure we both understand.

Maybe I'm not sure what "merging at the package level" means. Because merging happens between stacks, and everything else Pinto does appears to internally happen at the archive level.

thaljef commented 11 years ago

@schwern, I have come around to your point of view. Pinto manages distributions, not modules. So yes, I agree that a Pinto index should never contain just part of a distribution. When you add or pull a distribution to a stack, then any incumbent distributions which contain a package that exists in the incoming distribution should be removed from the index entirely.