More input/output options from the condaverse (and beyond)

Proposed change

What would the interest level be in some PRs for:

input files
- conda environment.yaml
- this is supported, while not as widely used
- probably a one-liner
- conda requirements.txt
- conda also supports this file, but is generally inferior to environment.yml... so i wouldn't propose supporting it. EXCEPT for a huge reproducibility/performance corner case. conda list --explicit will include the canary line @EXPLICIT and list the exact URLs (but no hashes :sob:) of the packages to be installed. An environment created from this file won't even invoke the solver, so you don't pay that tax at all. It cannot, however, include pip packages.
- anaconda-project.yml and anaconda-project-lock.yml (and .yaml variants)
- basically it's a meta-environment.yml that can create multiple environments, support environment inheritance, environment variables (sadly, only globally) and run comands in particular environments
- when coupled with an anaconda-proejct-lock.yml it achieves about the best level of reproducibility I've been able to get, and lets you do things like maintain an env on linux-64, but also perform the last-known-good solution for win-64 and osx-64 with one command.
- i've had a stormy relationship with this tool, but it's still kind of the best thing out there in conda-land
output formats
- constructor installers
- i've been hacking together something that works basically like repo2docker but uses platform-specific, offline-capable installers as the delivery mechanism
  - gets around the noarch: python thing by stuffing all those packages into a channel-in-a-package and installing them at post-install... could be expanded to support wheels, yarn (the JS one) offline mirrors, or whatever else doesn't need the internet or a reasonable facsimile thereof to do its installation thing
- conda-pack archives
- kind of the same idea as constructor, but you get a relocatable, platform-specific environment tarball out at the end
- packer ???
  - packer is super boss go executalbe that can make almost anything (OVA, ISO, docker, AMI, VMware, etc)

Alternative options

don't do them
make some entry_points or namespace packages so they could be done, but not do them in this repo
prototype fork

Who would use this feature?

Dunno, conda people. People that don't run docker. People that want stuff that installs at the end of the day, given bzip, rather than just runs and goes poof. People on windows. Peopl on ARM. People at conferences that want to install large, usable software environments.

How much effort will adding it take?

Varies based on which ones we're interested in.

Who can do this work?

I would be capable of doing all the work above, and it would trickle in really slowly, or potentially my $DAY_JOB alter ego might chip in if i can make the right case. Of the repo maintainers, I would just be asking for bounding of stuff that would be way out of scope, and review as the technical challenges are overcome.

Not quite sure I understand the end goal. Is it to create something that can read some/any of the input files and produce an output in the format of one of the output files?

Would this work while supporting all the different config files that repo2docker currently supports? I donn't think we want to go down the route where only some of the files do something and others are ignored. For example today you can have an evironment.yaml, apt.txt and postBuild work together to build the environment you need. So maybe the way to go is produce the docker image and then from that produce an alternative output?

Right now we already support some of the mentioned input files but the only thing repo2docker can produce is a docker image. There is no infrastructure yet to help with producing one of many output formats (-> needs a lot of work).

Thanks for the thoughtful reply! I was remiss for not including the usual, _Thanks for r2d, bh, jh, d-s and all the other parts of this fantastical contraption that makes it so easy to use advanced technology :heartdecoration:

A lot of these ideas came out of much earlier discussions around whether these things could be done by a KernelManager, but that has continued to elude us. Since this is all about envs, it seemed like a reasonable place to chime in.

read some/any of the input files and produce an output in the format of one of the output files?

Yes. Basically, the thought questions, irrespective of the specific proposals, are:

Are there other things than a many-layered container, docker builded from a Dockerfile, that can implement the REES?
Should such things be done in this repo?
If not done in this repo, could they be implemented as plugins?

Would this work while supporting all the different config files that repo2docker currently supports?

Yes, though some would be limited in their support. Things on the roadmap would encounter these limits anyway: I love debian, but apt.txt won't work with a Centos base image, no matter how hard you look at it. Nix would work on a CentOS container, but not a windows container. postBuild could be made to work on windows, just conda install bash, if you're feeling saucy.

I donn't think we want to go down the route where only some of the files do something and others are ignored.

Dockerfile currently trumps all, right? If a Dockerfile is present, all other configuration files will be ignored. It seems perfectly reasonable, if documented, that different files can interact in different ways.

So maybe the way to go is produce the docker image and then from that produce an alternative output?

I also love docker, for certain things, but it's just not appropriate, even as an intermediate, for other things. For example, my clear and present need is for a binder-like capability for bare metal. Conda (or other userspace solutions) are well understood and supported, so conda-pack would just solve my problem, while running docker (much less docker build and its ability to escalate privileges) is a non-starter.

There is no infrastructure yet to help with producing one of many output formats (-> needs a lot of work).

Right. We'd end up with some kind of well-known file intermediate representation (JSON schema or networkx graph or whatever), which could be realized with the tool of choice. Might need a solver.

There are some easy and some hard to answer questions here. I'll answer the easy ones mostly because my reply to the others would be "good questions, no idea what the answer is" :-/

Are there other things than a many-layered container, docker builded from a Dockerfile, that can implement the REES?

I think the answer should be yes. In practice it might be no (if we end up wanting an easy ride by saying REES v1 is what repo2docker does today). IMHO the whole point of trying to write down the REES is to not be in the situation where the behaviour is determined by the first implementation.

Should such things be done in this repo?

Figuring out that things like apt.txt will only work with a container with the correct base image and what to do about it regarding REES should happen here. As well as the work of hooking up the extension mechanism. After that the idea is that plugins can and should be built elsewhere.

I love debian, but apt.txt won't work with a Centos base image, no matter how hard you look at it. (and related)

Nods. I think of repo2docker and its build packs as "a stack of build packs": they work together (co-operate) and work together (compatible). An extension mechanism for build packs would allow someone else to define their own stack of build packs, which could declare "I promise I am compatible with the repo2docker core stack, please add me on" but it wouldn't have to (if you wanted to use a centos/windows/gentoo/etc base image you shouldn't indicate that you are an extension). My feeling is that "a separate stack" would be outside the REES, you are using repo2docker for some of the infrastructure it provides but doing your own thing otherwise. Or maybe the REES should have the idea of these stacks from day one?!

How to deal with Dockerfiles that indeed trump everything I don't know. Maybe they are already their own stack (that is "incompatible" with the rest of the repo2docker build pack stack?).

while running docker (much less docker build and its ability to escalate privileges) is a non-starter.

At the top of our dream-list is having an alternative to using docker. The name of this package is unfortunate, it should be repo2container. This means we reserve the right to stop using docker and instead use some other container builder. The fact that repo2docker generate an intermediate Dockerfile is very much an implementation detail (hence the comment in the docs/FAQ on "Can I use repo2docker to generate a Dockerfile for me?"). Super cool would be the ability to use one of the container builders that is "root-less". But this isn't even on the wish-list yet, it is somewhere on the dreams and aspirations list.

I have no good idea if we can extend repo2container to also be repo2conda-pack. My hunch/opinion is no, because I think containers are the best compromise between virtual machines and "virtual environments" in terms of weight, functionality, etc. This doesn't mean that I don't wish for something else some days, just that I don't (yet) know how to build it.

Closing thought: repo2docker has always aimed for serving "the majority" (say 80% of users) by making it easy for them to do what they want to do and telling everyone else that "we hear you, but you are in a minority with your specialist thing there, we have some escape hatches for you to use maybe that helps, but yeah sorry life is hard." This means we had to only write 20% of the code (that is how these 80-20 comparisons go right? 😀). -> maybe the way forward is to explore this and related ideas via sibling projects like repo2conda-pack and repo2runc?

(The older repo2docker (and presumably I) gets the more I find myself wanting to reply to new things: "there are already too many damn CLI flags for repo2docker, can we solve this via documentation instead of a new flag/code? The reason repo2docker is useful is because there is so much it can't do.)

Thanks for continuing to humor me!

Or maybe the REES should have the idea of these stacks from day one?!

Perhaps the lifecycle could be broken into several phases, say, identify, discover, satisfy, build, run and verify. A formalized spec (schema) would then live at the spots in between these activities, with repo2docker as a reference end-to-end implementation to which other implementations could aspire, rather than making it do more itself.

How to deal with Dockerfiles that indeed trump everything

Spitballing: if unbundled, r2d could be something like, taking a little poetic license:

r2e-run-notebook, r2e-run-r-studio, etc.
r2e-build-container-docker
r2e-discover-dockerfile (or, alternately)
- r2e-build-dockerfile
- ...
- r2e-satisfy-pip-conda
- r2e-discover-conda, r2e-discover-pip, ...
r2e-identify-git

They'd all want to be able to levy constraints (starting from git claiming a particular sha, up to r-studio demanding runtime parameters) but eventually the whole thing has to cool down...

I've opened https://github.com/jupyter/repo2docker/issues/682 to talk very specifically about cases where docker daemon isn't useful.

Following on from https://github.com/jupyterhub/repo2docker/issues/682 I opened https://github.com/jupyterhub/repo2docker/pull/848 And, if you're interested, I experimented with https://github.com/manics/repo2shellscript that generates a packer template :smile:

jupyterhub / repo2docker