Please preserve subdirectories in distribution paths

rafl commented 9 years ago

PAUSE gives authors the possibility to organise the files in their authors/N/NA/NAME/ directory by using subdirectories. http://cpan.metacpan.org/authors/id/S/SC/SCHWIGON/ is an example of an author directory making use of that feature for most distributions.

Currently, when wanting to import a distribution like that by its path you'd run pinto pull SCHWIGON/acme/Acme-Rautavistic-Sort-0.01.tar.gz. Once imported into pinto, however, you'll have to refer to the distribution as SCHWIGON/Acme-Rautavistic-Sort-0.01.tar.gz, as the subdirectory isn't preserved.

I think it'd be good to try to preserve those subdirectories upon pulling so that other tools relying on fully qualified distribution paths can work either against an actual CPAN mirror or against a pinto-maintained one without having to care about whether or not authors are making use of PAUSE's subdirectory feature.

thaljef commented 9 years ago

Can you give me a concrete example of those tools and how Pinto breaks them?

rafl commented 9 years ago

Telling a cpan client to install a distribution from an absolute PAUSE path is pretty common:

cpanm SCHWIGON/acme/Acme-Rautavistic-Sort-0.01.tar.gz

I expect that above command to always work the same way, no matter if I'm pointing the cpan client at an official cpan mirror or at a pinto mirror I maintain myself, which had the modules in question imported into it.

With pinto discarding subdirectories, that's not currently true.

thaljef commented 9 years ago

I'll have to think about this more. Pinto has never promised to be a proper CPAN mirror, only to be compatible with the CPAN toolchain. That subtle difference gives me the flexibility I need to add the features that make Pinto useful. If you really want a CPAN mirror, then Pinto may not be the right tool.

rafl commented 9 years ago

I think it's fair that Pinto mirrors want to be somewhat different from a real CPAN mirror.

The practical problem I was trying to solve by suggesting to preserve subdirectories is this: In my daily work with Pinto, I often deal with textual descriptions of pinto stacks. Those descriptions contain all the information needed to create a Pinto stack with all the distributions I expect to be there.

Currently I'm just using a list of full CPAN distribution paths to describe the contents of my stack. This makes it very easy to create the Pinto stack: pinto pull <stack-description.

However, I'd also like to take an existing stack in Pinto, and export into a stack description that I can easily share with other developers, so that they can easily re-create the stack I'm working on. Up until now, I've implemented this "stack export" by merely listing the distributions imported into that stack by their full dist path within the Pinto mirror. This works great, but as soon as you add CPAN distributions using subdirectories to the mix, things break. With Pinto essentially stripping out the subdirectory, I can no longer round-trip from a stack description to a reified Pinto stack and back to a stack description.

It occured to me that preserving the subdirectories would be a good way allow me to import and export my stacks. I understand your concerns with that approach, though.

If you decide you don't want to be preserving subdirectories in distribution paths, could you perhaps suggest other ways in which Pinto could safely be modified to enable the workflow I've described above? One idea would be to store information about the original distribution path that was used to pull a dist. Pinto could still expose the imported dists without the subdirectories, as it currently does, but it'd allow me to get access to the data I need for exporting a stack to a more compact "stack description" I can use to re-import again later.

thaljef commented 9 years ago

Pinto does store the URL that each distribution was pulled from. So you could get the URLs for all distributions in a stack like this:

pinto list --format %S | sort | uniq

Pinto doesn't understand URLs as targets[1] so you'll have to parse that a bit before you can reify a stack from it. But assuming all the distributions came from a CPAN, that should have the full paths for you. If there are any locally added distributions, the URL will just be LOCAL.

However, @tartansandal recently experimented with a "clone" command which might be more convenient and more correct. So perhaps we should continue this discussion over there at #189.

[1] I could certainly fix that.

rafl commented 9 years ago

Pinto storing this information is fantastic news. Thanks for pointing that out!

I'm completely happy to parse full CPAN distribution URLs back to relative paths within a CPAN mirror - CPAN::DistnameInfo makes that easy enough.

Unfortunately, the list command seems to be the only one currently supporting custom formats to extract the information I'm after here. Technically that'd be enough to implement just what I want, but I'd find it a whole lot more convenient for my particular use case if the diff command also supported custom formats.

I've created http://github.com/thaljef/Pinto/pull/200 to add that feature.

thaljef commented 9 years ago

I like #200 a lot. So are you planning to send out "patches" so that other folks can synchronize their stacks?

rafl commented 9 years ago

I'm glad you like that pull request. It would give me a way of solving the specific problem I had when creating this issue, and also add some general utility to Pinto.

The idea of sending out patches sounds really interesting, but it hadn't occured to me. I think my use of Pinto might be kind of exotic, so I'll try to describe it a bit:

First off, I use Pinto in "local mode". That is, I don't run a webserver anywhere hosting my stacks and their indicies. Instead, all developers have their own pinto repository in their development environment. When installing modules, I point cpanm directly at the directory containing the stack contents (--mirror file:///.../stacks/foo --mirror-only).

Stacks in that personal pinto repository get created from a file maintained in a project's repository. It's somewhat similar to cpanfile.snapshot, and contains a list of CPAN distribution paths. Those can be imported into Pinto as a certain stack using pull --no-recurse. The stack name is determined by hashing that list of distribution paths: pinto pull --stack $(myhash dependencies) --no-recurse <dependencies.

There's also a tool to build a new lib directory with all the dependencies listed in our dependency manifest that ensures all packages get installed in the right order to always produce repeatable lib directories even in the face of optional prerequisites.

Sometimes a developer needs to upgrade a module or install a new dependency. There exists some tooling around Pinto to help with that. It will pinto copy the stack describing the current dependency tree to an upgrade stack, and then perform whatever operation the developer requested on it, such as pulling in new distributions (and following their dependencies) or upgrading and removing existing ones. Once a developer is happy with their dependency changes, they can "commit" to that upgrade.

Comitting an upgrade will do several things. Most importantly, the list of CPAN distribution paths (the ones we seeded our Pinto stacks to begin with) needs to be updated. This updated list is what we distribute between each other using the application's git repository. When developers check out an updated version of the dependency list, tools will notice that they don't yet have a local stack with that dependency tree in it, import the updated dependency tree into Pinto, and produce a new lib directory containing the new dependencies.

This update of the main dependency manifest when developers modify it is something I'm currently implementing using pinto diff. I ask Pinto for the changes between the current stack and the upgrade stack, and apply those to the existing dependency manifest file (after some post-processing and sanity-checking, for example to make sure that we're not accidentally depending on two different versions of the same distribution or on two distributions that provide the same module).

It could have been implemented in terms of pinto list rather than pinto diff, but using diff seemed somewhat easier at the time. If i had used pinto list as the basis of this, I could've used its --format option and the %S format to extract all the data I needed to handle CPAN subdirectories in this setup. This is the reason I've added --format to pinto diff as well, to avoid having to re-implement the diffing and patching.

I hope this gives you a good idea of how I'm using Pinto.

I did omit a fair number of details of our setup, but most of those are really just optimisations and don't interact with Pinto itself much. For example, in addition to recording whith CPAN distribution paths we depend on, we also maintain additional metadata regarding the "safety" of an upgrade. With that metadata (which we generate by comparing a test-install at the time a developer "commits" an upgrade to the previous installed lib directory), we can decide if it's safe to just copy over the old lib tree and just install the new or updated modules on top of it, or if we have to build the new lib directory from scratch.

Let me know if there's anything else I could tell you to understand this unorthodox use case of Pinto :-)

tartansandal commented 9 years ago

Hi Florian,

Your use case is not so unorthodox ;-)

I too use Pinto in "local only" mode and manage shared stacks in a similar fashion. Your use of hashes and diffs is very interesting.

One difference in my usage is that I regard stacks as much more ephemeral beasts. If I have an 'upgrade' stack, that is a descendent of a 'current' stack, and I want to promote 'upgrade' to 'current', I simply delete the 'current' stack and copy 'upgrade' to 'current'.

You might be interested in a "thought piece" that I wrote to clarify some ideas while developing a 'verification' process:

https://github.com/tartansandal/Pinto/blob/audit-notes/lib/Pinto/Manual/Audit.pod

The last section describes a similar 'upgrade' process.

Cheers,

Kal

rafl commented 9 years ago

Hah.. I'm glad to see I'm not entirely alone with this :-)

One reason I went for naming my stacks after a hash of their contents is that it makes switching forth and back between different ones quite easy. The first time you check out a branch with a dependency manifest whose hash you don't yet have a stack for, it gets built (and the deps get installed to their own lib directory, and all that).

However, once you go back to the original branch with the previous dependency manifest, nothing much at all happens, as the stack and lib directory for the hash of that manifest already exist. This is really useful for folks jumping forth and back between topic branches all day.

thaljef commented 9 years ago

Pinto 0.09997 shipped today, with the diff formatting features you added. So I'm going to close this issue.

But I have some thoughts on your use case, and I will write them here as soon as I can.

tartansandal commented 9 years ago

On 24 March 2015 at 00:41, Florian Ragwitz notifications@github.com wrote:

One reason I went for naming my stacks after a hash of their contents is that it makes switching forth and back between different ones quite easy. The first time you check out a branch with a dependency manifest whose hash you don't yet have a stack for, it gets built (and the deps get installed to their own lib directory, and all that).

However, once you go back to the original branch with the previous dependency manifest, nothing much at all happens, as the stack and lib directory for the hash of that manifest already exist. This is really useful for folks jumping forth and back between topic branches all day.

We're still relatively new to Pinto, so our workflows are not quite as sophisticated as yours yet. Where appropriate, we name stacks after a particular branch, which helps with jumping back and forth between topic branches, however, we don't change our stacks very often and changes are much more of a manual process. We do want to introduce some more automation, once we've clarified the workflows some more, so I'm finding this discussion very helpful.

I'm assuming from your description that your dependency manifests must contain version information as well? (Otherwise topic branches that are used to test version upgrades would have the same hash as their parent). If this is so, it occurs that there is some overlap with Pinto's management of sets of versioned packages.

This is an interesting idea: using a dependency manifest as a "bridge" between your projects SCM and the SCM built into by Pinto. I use something similar, but its striped back to and ordered set of unversioned "roots" which can be applied to any stack. As you noted, the ordering is important so that building the corresponding lib is predictable.

thaljef commented 9 years ago

@rafl it sounds like your basic problem is synchronizing stacks across multiple repositories (which happen to correspond to multiple developers). I think a shared remote repository would help with that a lot. But then the problem becomes stack management. In particular, aligning git branches with pinto stacks. So I have an idea about that...

What if the default stack used by pinto could be set dynamically with arbitrary code? For example, you could shell out and get the name of the current git branch. So that might allow you to create a Pinto workflow that basically mirrors whatever you do with git. I can think of a couple different ways to implement that. But you get the idea.

Another suggestion is to put the PInto repository inside Git. I know it seems awkward to put binaries in your repo like that, but I've found that it works pretty well. Merges might seem ugly at first, but most of the time you just end up accepting one side or the other. And it sure is convenient to be able to build an entire app (including dependencies) with nothing more than a git checkout.

tartansandal commented 9 years ago

On 25 March 2015 at 11:04, Jeffrey Ryan Thalhammer <notifications@github.com

wrote:

What if the default stack used by pinto could be set dynamically with arbitrary code? For example, you could shell out and get the name of the current git branch. So that might allow you to create a Pinto workflow that basically mirrors whatever you do with git. I can think of a couple different ways to implement that. But you get the idea.

That is a very cool idea. I suppose the natural extension would be a 'hooks' configuration directory like git, where shell scripts could be called at appropriate stages to set/override defaults, or perform arbitrary actions.

Another suggestion is to put the PInto repository inside Git. I know it seems awkward to put binaries in your repo like that, but I've found that it works pretty well. Merges might seem ugly at first, but most of the time you just end up accepting one side or the other. And it sure is convenient to be able to build an entire app (including dependencies) with nothing more than a git checkout.

I've been gravitating towards that approach. The only thing holding me back is the idea of having an SCM inside an SCM which seems a little messy.

thaljef commented 9 years ago

I suppose the natural extension would be a 'hooks' configuration directory like git.

Yes, I've pondered that sort of thing too. But for this situation, the simplest thing might be an environment variable that just gets eval-uated.

tartansandal commented 9 years ago

On 25 March 2015 at 11:46, Jeffrey Ryan Thalhammer <notifications@github.com

wrote:

I suppose the natural extension would be a 'hooks' configuration directory like git.

Yes, I've pondered that sort of thing too. But for this situation, the simplest thing might be an environment variable that just gets eval-uated.

Either way, we still have to consider the conditions under which this would be set. Is it global in my environment? Is it overridden by the command line? What if I have two projects, both using Pinto, but using different SCMs? What if I also have projects with an SCM without a notion of branches (RCS)? This all suggest use of case-specific wrapper scripts, but then, they could just call pinto with the '--stack FOO' option.

Perhaps this does not need to acutually be directly integrated into Pinto. How about a post-checkout hook that simply runs:

pinto default $CURRENT_BRANCH

as appropriate for the particular SCM?

tartansandal commented 9 years ago

Thinking a bit more about the workflows that @rafl and I are using. I'm wondering if there is any merit in a "build" command that would build all the packages in a stack into a target directory. The signature would be something like:

pinto build [--stack NAME] [--self-contained] DIR

This would be similar to running

pinto list [--stack NAME] --format "%a/%f" | pinto install [-s NAME ] --local [--self-contained] DIR

except the packages would be built in a predictable/fixed reverse dependency order (leaves first). The ordering idea is a bit like the "roots" command turned upside down. From experience, this would not work all the time: sometimes you have to force some packages to be built before others to get things to work (e.g. specific dual-life packages). To cover that we might want some way to mark specific packages as having a higher priority in the build order. I'm thinking one level of priority may be enough, so this would operate similar to the '!important' flag in CSS.

thaljef commented 9 years ago

The roots command is flawed and destined for the trash can. Leaves-first might work much better.

But I'm still not sure you can get a good ordering without just having an ordered list.

thaljef commented 9 years ago

I'm wondering if there is any merit in a "build" command that would build all the packages in a stack into a target directory.

People have frequently asked for that, but I still think it is always prudent to have a canonical list of top-level dependencies. Unless you're fairly meticulous, stacks tend to accumulate cruft. And stacks may contain dependencies that are only intended for certain platforms. So attempting to install everything in a stack isn't always advisable.

But if you don't have those problems and you're willing to make the stack itself into the canonical list, then I'm sure a build command would be of use.

tartansandal commented 9 years ago

On 26 March 2015 at 16:23, Jeffrey Ryan Thalhammer <notifications@github.com

wrote:

But I'm still not sure you can get a good ordering without just having an ordered list

Yeah, I'm not entirely sure either. Leaves-first reduces the likelihood small random changes in ordering having a large impact. Throwing in a "important" flag would allow you to tweak for specific issues. The fixed ordering I was thinking of would be something like order by important DESC, num_dependencies ASC, name ASC

thaljef / Pinto

Please preserve subdirectories in distribution paths #194