ComputeCanada compatibility

spine-generic / data-multi-subject

Multi-subject data for the Spine Generic project

Creative Commons Attribution 4.0 International

22 stars 15 forks source link

ComputeCanada compatibility #5

Open kousu opened 4 years ago

kousu commented 4 years ago

@kousu should we also add doc for maintainer of this repos? e.g., i was trying to clone this repos on compute canada (to start doing some processing), and ran into:

git clone https://github.com/spine-generic/data-multi-subject.git
cd data-multi-subject
git annex init
git-annex: getUserEntryForID: does not exist (no such user)

Originally posted by @jcohenadad in https://github.com/spine-generic/data-multi-subject/pull/4#issuecomment-674176954

kousu commented 4 years ago

This looks like a bug in git-annex. It's probably trying to call https://linux.die.net/man/3/getpwent and failing because you don't have a real local Unix user on that cluster. It shouldn't matter but git-annex makes a lot of assumptions. Do you think it would be possible to try datalad for comparison? Try, say,

datalad install https://github.com/CONP-PCNO/conp-dataset.git

I would be surprised if one worked and the other didn't.

kousu commented 4 years ago

It's also possible that Compute Canada has an ancient version of git-annex installed. Debian is still on, I think, 6.x but datalad insists you need, I think, 8.x. If they're running centos it might be an even older version.

jcohenadad commented 4 years ago

ha!

(csa) [jcohen@gra-login1 ~]$ datalad install https://github.com/CONP-PCNO/conp-dataset.git
[INFO   ] Cloning https://github.com/CONP-PCNO/conp-dataset.git to '/home/jcohen/conp-dataset' 
[WARNING] It is highly recommended to configure git first (set both user.name and user.email) before using DataLad. Failed to verify that git is configured: CommandError: command '['git', 'config', 'user.name']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[ERROR  ] git-annex of version >= 6.20170220 is missing. Visit http://git-annex.branchable.com/install/. You have version 6.20170101 [annexrepo.py:_check_git_annex_version:734] (OutdatedExternalDependency)

kernel:

uname -a
Linux gra-login1 3.10.0-1127.8.2.el7.x86_64 #1 SMP Tue May 12 16:57:42 UTC 2020 x86_64 GNU/Linux

git-annex version
git-annex version: 6.20170101

jcohenadad commented 4 years ago

Hum, i only see this the 6.2 version of git-annex on the compute canada list of software

jcohenadad commented 4 years ago

in general (not specific to compute canada), we should probably add the requirements (git-annex version) for external people to be able to download our datasets

kousu commented 4 years ago

@jcohenadad if Compute Canada doesn't get back to you / says no, you can install a local copy (instructions):

git clone --depth 1 git://git-annex.branchable.com/ git-annex
cd git-annex
make install-home

echo $SHELL # double-check you're using bash!
echo "export PATH="~/.local/bin:${PATH}"' >> ~/.bash_profile
. ~/.bash_profile

This will install to ~/.local/bin, (like I've advocated for), which is a good way to get software installed without having to nag your sysadmins. Hopefully the cluster isn't set up too unusually and this will just work.

jcohenadad commented 4 years ago

i've tried the solution above, but ran into issues.

Full details

~~~ [jcohen@gra-login1 ~]$ git clone --depth 1 git://git-annex.branchable.com/ git-annex Cloning into 'git-annex'... remote: Enumerating objects: 15675, done. remote: Counting objects: 100% (15675/15675), done. remote: Compressing objects: 100% (15575/15575), done. remote: Total 15675 (delta 64), reused 13906 (delta 35) Receiving objects: 100% (15675/15675), 10.26 MiB | 23.29 MiB/s, done. Resolving deltas: 100% (64/64), done. Updating files: 100% (13091/13091), done. [jcohen@gra-login1 ~]$ cd git-annex/ [jcohen@gra-login1 git-annex]$ make install-home make install-bins PREFIX=/home/jcohen/.local make[1]: Entering directory '/home/jcohen/git-annex' if [ "cabal" = ./Setup ]; then ghc --make Setup; fi if [ "cabal" != stack ]; then \ cabal configure --ghc-options=""; \ else \ cabal setup ; \ fi Config file path source is default config file. Config file /home/jcohen/.cabal/config not found. Writing default configuration to /home/jcohen/.cabal/config Warning: The package list for 'hackage.haskell.org' does not exist. Run 'cabal update' to download it. cabal update Resolving dependencies... Warning: solver failed to find a solution: Could not resolve dependencies: trying: git-annex-8.20200810 (user goal) next goal: base (dependency of git-annex-8.20200810) rejecting: base-4.9.1.0/installed-4.9... (conflict: git-annex => base(>=4.11.1.0 && <5.0)) Dependency tree exhaustively searched. Trying configure anyway. Utility/Process.hs:210:2: error: warning: #warning building with process-1.6.3; some timeout features may not work well [-Wcpp] 210 | #warning building with process-1.6.3; some timeout features may not work well | ^~~~~~~ [ 1 of 34] Compiling Utility.SystemDirectory ( Utility/SystemDirectory.hs, dist/setup/Utility/SystemDirectory.o ) [ 2 of 34] Compiling Utility.Split ( Utility/Split.hs, dist/setup/Utility/Split.o ) [ 3 of 34] Compiling Utility.Process.Shim ( Utility/Process/Shim.hs, dist/setup/Utility/Process/Shim.o ) [ 4 of 34] Compiling Utility.PartialPrelude ( Utility/PartialPrelude.hs, dist/setup/Utility/PartialPrelude.o ) [ 5 of 34] Compiling Utility.Monad ( Utility/Monad.hs, dist/setup/Utility/Monad.o ) [ 6 of 34] Compiling Utility.Misc ( Utility/Misc.hs, dist/setup/Utility/Misc.o ) [ 7 of 34] Compiling Utility.FileSize ( Utility/FileSize.hs, dist/setup/Utility/FileSize.o ) [ 8 of 34] Compiling Utility.DebugLocks ( Utility/DebugLocks.hs, dist/setup/Utility/DebugLocks.o ) [ 9 of 34] Compiling Utility.Data ( Utility/Data.hs, dist/setup/Utility/Data.o ) [10 of 34] Compiling Utility.Exception ( Utility/Exception.hs, dist/setup/Utility/Exception.o ) [11 of 34] Compiling Utility.Env.Basic ( Utility/Env/Basic.hs, dist/setup/Utility/Env/Basic.o ) [12 of 34] Compiling Utility.FileSystemEncoding ( Utility/FileSystemEncoding.hs, dist/setup/Utility/FileSystemEncoding.o ) Utility/FileSystemEncoding.hs:46:1: error: Failed to load interface for ‘System.FilePath.ByteString’ Use -v to see a list of the files searched for. make[1]: *** [Makefile:37: tmp/configure-stamp] Error 1 make[1]: Leaving directory '/home/jcohen/git-annex' make: *** [Makefile:31: install-home] Error 2 ~~~

jcohenadad commented 4 years ago

I'm trying now to install git-annex v8 via conda:

wget -O/tmp/Miniconda3-latest-Linux-x86_64.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -f
. ~/miniconda3/bin/activate
conda install -y -c conda-forge git-annex

worked like a charm 🎉 maybe we should document this solution (for non-experienced users)

mboisson commented 4 years ago

Just a note here that Compute Canada discourages the use of Anaconda for reasons explained here: https://docs.computecanada.ca/wiki/Anaconda/en

But if it works for this specific case, then good.

kousu commented 4 years ago

Just a note here that Compute Canada discourages the use of Anaconda for reasons explained here: https://docs.computecanada.ca/wiki/Anaconda/en

But if it works for this specific case, then good.

Interesting. I'm not surprised they take this position. Conda is, essentially, an entire distro unto itself running on top of your pre-existing distro. For Windows and macOS which don't have good package managers it is very helpful to provide something that works. ((I kind of wish they'd just decided to contribute to brew, instead, though; brew runs on Linux, and they could've gotten it running on Windows too, instead of making an entirely separate ecosystem)). Thanks for pointing this out.

Anyway the issue here isn't about numerical computing or isolating, it's just that no one else provides up-to-date binaries for git-annex and compiling haskell is (as Julien discovered) not straightforward. I see conda as a workaround in this situation.

We have a ticket open with ComputeCanada to upgrade git-annex. It will probably take a long time though, and I imagine there is going to be resistance (rightfully) from other users who are purposely using an old version.

conda lets us not step on each others' toes. I don't understand how it's any different than using singularity: that also installs an isolated distro to ~/.

wget -O/tmp/Miniconda3-latest-Linux-x86_64.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -f
. ~/miniconda3/bin/activate
conda install -y -c conda-forge git-annex

worked like a charm tada maybe we should document this solution (for non-experienced users)

It's in https://github.com/spine-generic/data-multi-subject/blob/master/CONTRIBUTING.md#any-git-annex-error! It could be more prominent though. You wouldn't want to blindly tell people to install miniconda, either, since they might already have it, or might need to choose a different version, or might be better off using brew.

Or we could just decide that conda is our distro. But then we're cutting our people who want to use neurodebian or people who want to use Compute Canada the "Right Way". That choice would let us develop faster, but might hinder sharing with the broader world. I don't know what the right balance is.

mboisson commented 4 years ago

(Full disclosure, I am a Compute Canada staff). Anaconda is the source of a lot of problems on supercomputers/clusters. It tries to do way too many things not the right way, including reinstalling the world. It will never work well on a cluster.

If you go the conda route, we will just have to figure out how to reverse engineer it to avoid conda, which means it will take longer for us to install your software for our users.

We have very good mechanisms to handle multiple versions : we handle about 5000 builds of different software packages/versions/toolchains/CPU architectures, so we know what we're talking about when it comes to installing packages.

For git-annex, it would be best to actually provide straight binaries that we install as-is. If I recall correctly, the binaries are pretty much self-contained.

kousu commented 4 years ago

Anaconda is the source of a lot of problems on supercomputers/clusters. It tries to do way too many things not the right way, including reinstalling the world. It will never work well on a cluster. If you go the conda route, we will just have to figure out how to reverse engineer it to avoid conda, which means it will take longer for us to install your software for our users.

^ @jcohenadad this is what I've been saying about conda!

/cc https://github.com/neuropoly/spinalcordtoolbox/issues/1526, https://github.com/shimming-toolbox/shimming-toolbox-py/pull/28#issuecomment-661921548, https://github.com/shimming-toolbox/shimming-toolbox-py/pull/15#issuecomment-658965963, https://github.com/shimming-toolbox/shimming-toolbox-py/pull/7#discussion_r451178106

This repo is just a dataset so there's not much to reverse engineer, we just need updated software deployed; but Maxime's words apply very well to our software projects.

But I still don't understand how recommending containers is any different than recommending conda. The software on docker hub aren't tuned for HPC any more than conda is. What's the thinking there?

I would love to pick your brain about containerization, packaging, , apks, .apps, Windows DLLs and all the rest. You must have loads of experience. There's a cold war brewing between corporations with containers and linux with integrated systems.

The unresolved tension is that containerization gives a reliable way for devs to get their software to their users, giving faster feedback and more value to their users, but that same speed forces homogenization (e.g. "you're not using a Nexus 5? Sorry". "You're not using Android 10x SE? Sorry." or issues with Signal or snapd), lack of oversight (left-pad, event-stream, The Backstabber's Knife Collection: A Review of Open Source Software Supply Chain Attacks, Security Issues in Language-based Software Ecosystems, An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications, simple typosquatting), and poor platform-specific optimization like you say. My thinking right now is that both sides have their merits, and that I'm sad that the two sides can't negotiate something that works, and I don't think the two sides recognize that it is a war, they just don't understand why the other side is doing things the "wrong way".. Maybe https://guix.gnu.org/ or https://distr1.org/ is that negotiation.

This situation is a great example of this tension: your backlog porting software to ComputeCanada is long; conda's is short, so that's driving us towards conda, even if it's less optimized.

..but I'm meandering. A sysadmin's life is busy -- I know, I've been one. Thanks for taking the time to advise us. :tiger2:

We have very good mechanisms to handle multiple versions : we handle about 5000 builds of different software packages/versions/toolchains/CPU architectures, so we know what we're talking about when it comes to installing packages.

oo are you effectively running your own distro then? I guess so: https://docs.computecanada.ca/wiki/Installing_software_in_your_home_directory

For git-annex, it would be best to actually provide straight binaries that we install as-is. If I recall correctly, the binaries are pretty much self-contained.

Yeah! That would be great!

I know most distros have problems keeping up with git-annex because it changes on a faster cycle than distros do: https://git-annex.branchable.com/install/#comment-c146d2b76b24fae110f07c8babfae5e5.

mboisson commented 4 years ago

Lots of things here...

But I still don't understand how recommending containers is any different than recommending conda. The software on docker hub aren't tuned for HPC any more than conda is. What's the thinking there?

Indeed. Containers are the lesser of two evils. They aren't optimized for HPC, but at least 1) they don't install zillions of files like conda does. This creates havock on a parallel filesystem like what is used on HPC clusters. They have very fast filesystems, but they are meant to handle small numbers of large files, not large number of small files. 2) They are actually containerized. Conda is not containerized. It interacts with the rest of the OS, with the system configuration, it messes up the user's .bashrc, etc.

We have very good mechanisms to handle multiple versions : we handle about 5000 builds of different software packages/versions/toolchains/CPU architectures, so we know what we're talking about when it comes to installing packages.

oo are you effectively running your own distro then? I guess so:

We do. Nearly all supercomputing centers use module systems such as Lmod (https://lmod.readthedocs.io/en/latest/) to give users access to a wide variety of software packages. We go one step further, as our software stack is actually portable, as you can mount our filesystem on any Linux computer in the world, and it will "just work", and be optimized for CPU your architecture. https://docs.computecanada.ca/wiki/Accessing_CVMFS

As far as recommendations go, make sure your source code uses a sane configure tool, like Autoconf or CMake. Make sure it does not vendor dependencies, or if it does, provide a way to use pre-installed versions. Make sure that it can be compiled rather easily from source using modern compilers.

And if you want a good laugh, and further recommendations (and counter examples), please watch this talk : https://www.youtube.com/watch?v=NSemlYagjIU It is from the main author of EasyBuild (https://easybuild.readthedocs.io/en/latest/), which is the tool we use to install nearly all scientific software packages on our clusters. He has years of experience building software, and has seen it all... from the best ... to the worst.

mboisson commented 4 years ago

Oh, and by the way, once you got the basics right and your code installs well with a tool like Autoconf or CMake, it is trivial to make it into a container, or ship it in conda, or document its installation from source, so it supports every target.

kousu commented 4 years ago

Indeed. Containers are the lesser of two evils. They aren't optimized for HPC, but at least

1. they don't install zillions of files like conda does. This creates havock on a parallel filesystem like what is used on HPC clusters. They have very fast filesystems, but they are meant to handle small numbers of large files, not large number of small files.

https://distr1.org/ should interest you. It's by someone who spent a decade of his life loving and hating apt, and now he's out and doing something else. One of its main features is image-based installation; instead of unpacking images they are mounted, and so there's no way to accidentally corrupt an install of a package, but unlike docker/singularity, images can depend on each other instead of being entirely self-contained, so you get deduplication.

2. They are actually containerized. Conda is not containerized. It interacts with the rest of the OS, with the system configuration, it messes up the user's .bashrc, etc.

This may have been true in the past but now all it does is add:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/opt/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/opt/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/opt/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/opt/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

and that can be removed with conda init --reverse.

It's a little bit longer but effectively the same as adding echo 'export PATH=~/.local/bin:$PATH' >> ~/.bashrc. It's also the exact same idea as lmod.

conda keeps its own $PREFIX and isolates everything there. And the install script takes -p $PREFIX as an argument so you can direct it if it's making the wrong decision. In Julien's example, everything is under ~/miniconda3/. Did it used to try to write to /usr/? That would be very bad! But I don't believe that's what it's doing; give it another shot, it might have improved.

Incidentally conda exists because PyPA didn't have the bandwidth or desire to support numerical computing in python: https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/#Myth-#4:-Creating-conda-in-the-first-place-was-irresponsible-&-divisive

The genesis of Conda came after Guido van Rossum was invited to speak at the inaugural PyData meetup in 2012; in a Q&A on the subject of packaging difficulties, he told us that when it comes to packaging, "it really sounds like your needs are so unusual compared to the larger Python community that you're just better off building your own" (See video of this discussion).

If there are performance problems with conda on HPC I am sure they would love to take them up.

And if you want a good laugh, and further recommendations (and counter examples), please watch this talk : https://www.youtube.com/watch?v=NSemlYagjIU It is from the main author of EasyBuild (https://easybuild.readthedocs.io/en/latest/), which is the tool we use to install nearly all scientific software packages on our clusters. He has years of experience building software, and has seen it all... from the best ... to the worst.

Cool, that's a great reference for my notes :)

As far as recommendations go, make sure your source code uses a sane configure tool, like Autoconf or CMake. Make sure it does not vendor dependencies, or if it does, provide a way to use pre-installed versions. Make sure that it can be compiled rather easily from source using modern compilers.

Ah but we're not writing C. All of our projects are in python, though we have lots of compiled python extensions as dependencies. We also use pytorch which is a pain (as in: a complete non-starter for anyone without a build farm or at least a box in a data centre) to install so we pinned it to these wheels instead, which is going to be a complication for packaging.

I see that you've thought through python in https://docs.computecanada.ca/wiki/Python#Installing_packages and https://docs.computecanada.ca/wiki/Available_Python_wheels by making an alternative-to-PyPI distro, which includes most of our dependencies. But in order to use them with our software you'd have to do something like:

pip install --no-index git+https://github.com/neuropoly/spinalcordtoolbox/

but that probably won't work because some of our dependencies are only on PyPI. You'd have to go through them one by one.

But I wonder if doing this would neatly sidestep the issue:

pip install --index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2 --extra-index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic --extra-index-url = https://pypi.python.org/simple git+https://github.com/neuropoly/axondeepseg.git

that would let you transparently override PyPI without disabling it.

You could make it permanent with

# /etc/pip.conf
index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2
extra-index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
extra-index-url = https://pypi.python.org/simple

I think there is some anxiety out in the aether about distros, though. This is part of that cold war I'm talking about. For example, https://github.com/scikit-image/scikit-image/issues/4261#issuecomment-557322318 is concerned about controlling code quality by controlling distribution; they want the final say over what platforms are supported and don't value effort towards making source distributions that aren't built by their CI machines; it's interesting to see the parallel with moxie0's concern about control which has caused such flak). But your cluster is never going to be something they can commit to supporting officially, so you're always going to be adapting their code to your system but not the other way around.

We have a similar case; in one project we distribute requirements-freeze.txt in order to get reproducible numbers; so I am a bit anxious about equating your wheels with the versions we're testing in our CI and on our laptops, because you've made subtle changes to them without always changing the version numbers, and that might affect the results in subtle ways.

mboisson commented 4 years ago

2. They are actually containerized. Conda is not containerized. It interacts with the rest of the OS, with the system configuration, it messes up the user's .bashrc, etc.

This may have been true in the past but now all it does is add:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/opt/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/opt/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/opt/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/opt/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

That still counts as "messing with the user's .bashrc". We can't debug conda, so the first step when a user has a problem and they have some conda stuff is : remove everything conda from your session. And 9 times out of 10, it fixes the issue.

conda keeps its own $PREFIX and isolates everything there. Except it doesn't. It still depend on system libraries, the linux loader, and this is just not portable. Your mileage may vary. Our stack does not depend on your libstdc++, nor on your glibc, not even on your linux loader. We provide it all. The only things it depends on are the linux kernel and the hardware drivers. Conda makes assumptions about a Linux system that simply aren't true all the time.

As far as recommendations go, make sure your source code uses a sane configure tool, like Autoconf or CMake. Make sure it does not vendor dependencies, or if it does, provide a way to use pre-installed versions. Make sure that it can be compiled rather easily from source using modern compilers.

Ah but we're not writing C. All of our projects are in python, though we have lots of compiled python extensions as dependencies. We also use pytorch which is a pain (as in: a complete non-starter for anyone without a build farm or at least a box in a data centre) to install so we pinned it to these wheels instead, which is going to be a complication for packaging.

I see that you've thought through python in https://docs.computecanada.ca/wiki/Python#Installing_packages and https://docs.computecanada.ca/wiki/Available_Python_wheels by making an alternative-to-PyPI distro, which includes most of our dependencies. But in order to use them with our software you'd have to do something like:
pip install --no-index git+https://github.com/neuropoly/spinalcordtoolbox/

If you are doing python, then just make sure that pip wheel <your package> builds easily, and it's no issue for us. We will build the wheel from source and put it in our wheelhouse. We would not do the above, we would build a wheel from your package, which is hopefully published on pypi, but we can handle github, so long as they are properly tagged and versions. We won't install from "master", that's just bad practice and non-reproducible. Installing directly from the master branch of a repository is a big no-no. Provide proper releases, with proper version numbers, so that two build at two different times are reproducible.

but that probably won't work because some of our dependencies are only on PyPI. You'd have to go through them one by one.

That's not an issue, we already provide thousands of those.

But I wonder if doing this would neatly sidestep the issue:
pip install --index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2 --extra-index-url = /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic --extra-index-url = https://pypi.python.org/simple git+https://github.com/neuropoly/axondeepseg.git
that would let you transparently override PyPI without disabling it.

We wouldn't do that. Our python is already configured to look into our wheelhouse by default, and to exclude any manylinux wheel from the web (because those too, are not really self-contained). It falls back on pypi as a last resort if Internet is available, but this is also something that should not happen. Internet is not available on the compute nodes, and virtual environments have somewhat the same flaw as Conda (although to a much lesser extent, since it's just python and not the world) when it comes to hammering down the filesystem with tons of small files, so our recommendation is to create the environment on the compute node's local disk, from inside of the job. This means that the creation of the virtual environment can't depend on the Internet being available, but only on what we provide in the wheelhouse.

I think there is some anxiety out in the aether about distros, though. This is part of that cold war I'm talking about. For example, scikit-image/scikit-image#4261 (comment) is concerned about controlling code quality by controlling distribution; they want the final say over what platforms are supported and don't value effort towards making source distributions that aren't built by their CI machines; it's interesting to see the parallel with moxie0's concern about control which has caused such flak). But your cluster is never going to be something they can commit to supporting officially, so you're always going to be adapting their code to your system but not the other way around.

We have a similar case; in one project we distribute requirements-freeze.txt in order to get reproducible numbers; so I am a bit anxious about equating your wheels with the versions we're testing in our CI and on our laptops, because you've made subtle changes to them without always changing the version numbers, and that might affect the results in subtle ways.

We don't change anything in the code unless it's broken. We merely build wheels from the source code. So assuming you do a good job of providing code that is easy to build, it should not be an issue.

mboisson commented 4 years ago

To be honest, I don't know if Conda is that bad in itself, but it enables so many bad practices... You can install stuff like R, Cuda, GCC, OpenMPI.... there is absolutely no reason to install those through a python package manager. And it does a bad job at installing them because it doesn't know about the system subtleties.

With Conda, we also see users messing up their PYTHONPATH (which they shouldn't change), or their LD_LIBRARY_PATH (which they should never use), and it wrecks havoc on the rest of the environment. Using those environment variables is like using a jack-hammer to kill a fly.

kousu commented 4 years ago

Hey, I appreciate the brainstorming session. This is really really cool :) :) :)

I'm splitting up my responses to be nicer to my laptop and to future people referring to this thread.

That still counts as "messing with the user's .bashrc". We can't debug conda, so the first step when a user has a problem and they have some conda stuff is : remove everything conda from your session. And 9 times out of 10, it fixes the issue.

I want to understand problems you've had with conda I share your instinct about it: "this thing is doing everything differently, it wants to take over the whole system, it's going to make a mess of stuff" yet I worry that that's just an innate conservativism in myself and not based in facts. I want to know if I should advocate for or against conda being our default platform, which is what we are currently doing in several projects like https://github.com/neuropoly/axondeepseg.

It's basic implementation seems be identical to lmod. lmod changes your PATH and LD_LIBRARY_PATH and adds a bash (/zsh/csh/...) function module that can further edit it, just like conda activate or conda deactivate.

This is the same thing conda does. In fact, at least for its base env, it doesn't even touch LD_LIBRARY_PATH:

[kousu@requiem ~]$ set > /tmp/a.envs
[kousu@requiem ~]$ conda activate 
(base) [kousu@requiem ~]$ set > /tmp/b.envs
(base) [kousu@requiem ~]$ diff -u /tmp/{a,b}.envs
--- /tmp/a.envs 2020-08-20 17:15:17.381378507 -0400
+++ /tmp/b.envs 2020-08-20 17:15:26.811460189 -0400
@@ -10,10 +10,13 @@
 BASH_VERSION='5.0.17(1)-release'
 CLICOLOR=1
 COLUMNS=150
+CONDA_DEFAULT_ENV=base
 CONDA_EXE=/opt/miniconda3/bin/conda
-CONDA_INTERNAL_OLDPATH=/opt/miniconda3/bin:/home/kousu/go/bin:/home/kousu/.local/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/kousu/.local/MATLAB/bin:/usr/local/stata:/home/kousu/.rvm/bin
+CONDA_INTERNAL_OLDPATH=/home/kousu/go/bin:/home/kousu/.local/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/kousu/.local/MATLAB/bin:/usr/local/stata:/home/kousu/.rvm/bin
+CONDA_PREFIX=/opt/miniconda3
+CONDA_PROMPT_MODIFIER='(base) '
 CONDA_PYTHON_EXE=/opt/miniconda3/bin/python
-CONDA_SHLVL=0
+CONDA_SHLVL=1
 DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus
 DIRSTACK=()
 EDITOR=vis
@@ -35,11 +38,11 @@
 OPTERR=1
 OPTIND=1
 OSTYPE=linux-gnu
-PATH=/home/kousu/go/bin:/home/kousu/.local/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/kousu/.local/MATLAB/bin:/usr/local/stata:/home/kousu/.rvm/bin
+PATH=/opt/miniconda3/bin:/home/kousu/go/bin:/home/kousu/.local/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/kousu/.local/MATLAB/bin:/usr/local/stata:/home/kousu/.rvm/bin
 PIPESTATUS=([0]="0")
 PPID=1115466
 PROMPT_COMMAND='printf "\033]0;%s@%s:%s\007" "${USER}" "${HOSTNAME%%.*}" "${PWD/#$HOME/\~}"'
-PS1='[\u@\h \W]\$ '
+PS1='(base) [\u@\h \W]\$ '
 PS2='> '
 PS4='+ '
 PWD=/home/kousu
@@ -60,7 +63,7 @@
 XDG_SESSION_CLASS=user
 XDG_SESSION_ID=189
 XDG_SESSION_TYPE=tty
-_=deactivate
+_=set
 _CE_CONDA=
 _CE_M=
 new_dirs=/home/kousu/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share
(base) [kousu@requiem ~]$

And I bet for lmod you put something in /etc/profile.d/ to install it globally? conda does the same, if installed globally.

So the problem isn't that is touches .bashrc, it must be something else.

virtual environments have somewhat the same flaw as Conda (although to a much lesser extent, since it's just python and not the world

Compiled C code is not "the world". There are lots of perfectly serviceable libraries and apps written in non-C. Lots of people spend a lot of their lives maintaining those ecosystems.

And it's possible to ship and load arbitrary compiled DLLs with python. They can override arbitrary C functions for you app.

But what they can't do -- and neither can conda -- is wreck the underlying OS, because they live in their own PREFIX! So what's the difference?

there is absolutely no reason to install those through a python package manager. And it does a bad job at installing them because it doesn't know about the system subtleties.

It helped me to read and digest https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/. In short, conda began life as a python package manager to deal with numpy, back before wheels existed, but only because PyPA refused to support its goals, and it has since become a general package manager, whose associated distro(s) is focused primarily on scientific software. It's much more like brew than pip

Being a complete packaging system+distro is necessary because lots of scientific computing crosses langauges e.g. we use dcm2niix, ants and ctrDetect; a previous scientific codebase I worked on uses libsvm; datalad needs git-annex. In the latter two, we both settled on instructing users to get the dependency on their own -- with tips towards brew because that covers probably 80% of scientists -- except on Windows where I vendored libsvm because that's the Windows tradition (and anyway there's no package manager I know of that's reasonable to point Windows users at -- choco and pip don't cut it).

The broader geologic force that's pushing people to conda (and brew!) -- and nixos, npm, yarn, pip, CPAN, maven, docker hub, and the AUR -- is not a technical feature, it's a social one: most distros are slow to get software out, but those ones (and maybe nix?) let their users update packages.

I see resistance from the distro side (e.g.) about "you foolish javascript(ou w/e) kids" designing software without forethought and needing to change everything rapidly &c, and that's surely an issue, but bugs are a fact of life, and the more people using something sooner the faster they are caught. There's value in following that strategy. By letting users contribute to updating software it shares the load and provides a better platform for everyone. For example, dcm2niix is out of date in debian but up to date in anaconda and in brew. The same is true of git-annex!

But they are really shooting themselves -- and the scientific computing community that they're selling to -- in the feet by not being more clear about this; "conda" is clearly a pun on "python". :face_with_thermometer: . pip is having the same marketing problem right now, together leading to:

xkcd's take on python environment woes

but that probably won't work because some of our dependencies are only on PyPI. You'd have to go through them one by one.

That's not an issue, we already provide thousands of those.

You really really do. I just looked over the list in detail and it's really impressive. It has things I was surprised you had covered.

But it's not a complete mirror though. It's missing these wheels:

futures
ivadomed
pyqt5
requirements-parser
transforms3d
mpld3
albumentations
phantominator

(also you've renamed a bunch of things by replacing - with _; won't those changes require you editing our setup.pys??)

Without fully automated pypi mirroring, or at least the ability to build wheels as needed, we can't self-manage our builds. We have to bug you to get stuff updated, or turn to workarounds like conda.

I guess we /could/ make our own wheelhouse. But who knows how compatible it would be; it sounds like you have python-specific compiler flags too.

conda keeps its own $PREFIX and isolates everything there. Except it doesn't. It still depend on system libraries, the linux loader, and this is just not portable. Your mileage may vary. Our stack does not depend on your libstdc++, nor on your glibc, not even on your linux loader. We provide it all. The only things it depends on are the linux kernel and the hardware drivers. Conda makes assumptions about a Linux system that simply aren't true all the time.

It sounds like that's your real issue with conda: it's not the package manager, it's the distro attached to it. Those binaries are superficially compatible with ComputeCanada, but not really, giving people strange and confusing bugs that you're stuck debugging for them. Deploying binaries compiled on a personal computer to a high-performance cluster computer is always going to be a source of problems.

I see two ways to negotiate out of this:

Like you do with pip, provide your own conda channel where you run all their build scripts on your compilers; install conda globally and configure it to use that channel by default. Keep your warning about using the Anaconda or conda-forge binaries on ComputeCanada.
Provide an open source repo where people can contribute Modulefiles for ComputeCanada. Make sure people can build and test (and use! and share the build scripts without vetting!) these in their homedirs so they don't need to wait for our distro team to get through your backlog. And the same for your python wheelhouse: I want to be able to contribute to it so I can help maintain what packages are available.

Maybe there's another?

I'd like to reiterate: people choosing conda aren't doing it to be nefarious, they're doing it because it fills a need that isn't being well served by any other tool. Particularly, it makes Windows and macOS look a lot more similar than different, and that is a real concern for people trying to get science done, as much as I can dream of the Future Linux Hegemony doing away with platform compatibility issues (not really, that would also be terrible; and anyway then it would just become the same issue but as PC linux vs HPC linux vs Cloud linux).

kousu commented 4 years ago

I am worried about conda becoming the platform as much as I am worried about docker being the platform -- I prevu that it will work great for a while and then collapse like a house of cards when hardware trends shift again. I think the only way to insulate against that is to build good cross-platform systems from the start, and to encourage complete builds from source.

In fact, don't the same points apply to docker and singularity? Despite being /more/ isolated they still depend on the syscalls the kernel provides, and the hardware available underneath. It's rare you can run a docker image from two years ago on this years' machines, and the software in them is generally going to be stuff pulled out of debian or alpine -- stuff compiled for the PC, not HPC. Singularity must have the same problem. How could it not?

I think if you don't want people to use unoptimized virtualization, you need to look into teaching people how to use guix or nix to declare their computing environment as code and be able to deploy it seamlessly from their home machines to your machines. Otherwise they're going to keep reaching for these platforms because they provide a stable base to build on.

kousu commented 4 years ago

I don't understand how lmod has better performance characteristics than conda. It, presumably, also exiges reading lots of small files, depending on the software loaded.

Maybe you can adapt distr1's idea of distributing packages as mountable squashfses?

kousu commented 4 years ago

We won't install from "master", that's just bad practice and non-reproducible.

Yes, of course :confounded: , add @v1.0.3 or whatever to that. I was being lazy and just showing an example. Sorry.

kousu commented 4 years ago

Okay now big thought #2:

We don't change anything in the code unless it's broken. We merely build wheels from the source code. So assuming you do a good job of providing code that is easy to build, it should not be an issue.

If you are doing python, then just make sure that pip wheel <your package> builds easily, and it's no issue for us. We will build the wheel from source and put it in our wheelhouse.

I don't think we can. It's more complicated than that. I've been having a discussion with scipy and scikit-image over their build-time dependencies and they taught me that

https://github.com/scikit-image/scikit-image/issues/4261#issue-509476547

building scikit-image with a newer version of numpy than the user had installed in their current environment [...] would result in strange and confusing bugs.

https://github.com/neuropoly/spinalcordtoolbox/issues/2841#issuecomment-674467049

The problem with allowing users to build on unsupported platforms is that while thye may be able to install things, it is totally unclear if it will work as intended, or what bugs might occur.

They're even considering removing sdists entirely (https://github.com/scikit-image/scikit-image/issues/4261) (something decried by this distro maintainer. I don't think they're going to go through with it, but their argument for it is pretty strong: they only vet compatibility with a particular version of numpy, and they only want to support people who are using their manylinux/macos/win wheels. Anyone building from source is on their own, and is expected to manually compile against the vetted version of numpy.

So running pip wheel . isn't enough when there are build dependencies like numpy or cython involved; there's this pyproject.toml thing now whose main purpose is to declare build depends, but it's buggy:

So we're at a dead end.

Similarly, building with different compilers or compiler options can change behaviour subtly. It shouldn't but there's a steep stack and I don't think it's numerically stable enough yet. My understanding is that that's why people are migrating to containers and virtual environments, because they've decided they can't really trust dependencies to stay stable: a minor change in some library can have unintended consequences downstream.

Maybe one can see virtualenvs as a workaround for bad code, but it's equally valid -- and more and more popular -- to see them as a reliable platform. It rubs me the wrong way, but I think a hundred thousand devs probably have some insight to share, y'know?

kousu commented 4 years ago

I'm not saying distros are old school and need to get with the times and switch over to a move-fast-and-break-things everyone-can-upload-anything situation. Instead I'm saying there's a conflict in perspective here, and on both sides it comes from legitimate concerns. I hope that instead of writing off tools like conda and docker and go mod vendor distro-makers can appreciate the end-user frustrations that motivated them, and that instead of writing off customized unix systems devs could appreciate why people might need -- or even just want, want should be legitimate too -- to customize their systems. We need to talk to each other instead of assuming the other side is just following a fad.

mboisson commented 4 years ago

I probably won't be addressing all of this, because it is rather long, but here is a couple of answers.

Hey, I appreciate the brainstorming session. This is really really cool :) :) :)

I'm splitting up my responses to be nicer to my laptop and to future people referring to this thread.

That still counts as "messing with the user's .bashrc". We can't debug conda, so the first step when a user has a problem and they have some conda stuff is : remove everything conda from your session. And 9 times out of 10, it fixes the issue.

I want to understand problems you've had with conda I share your instinct about it: "this thing is doing everything differently, it wants to take over the whole system, it's going to make a mess of stuff" yet I worry that that's just an innate conservativism in myself and not based in facts. I want to know if I should advocate for or against conda being our default platform, which is what we are currently doing in several projects like https://github.com/neuropoly/axondeepseg.

It's basic implementation seems be identical to lmod. lmod changes your PATH and LD_LIBRARY_PATH and adds a bash (/zsh/csh/...) function module that can further edit it, just like conda activate or conda deactivate.

Yes and no. The different is in who is installing things, and how well they know the system. First off, Lmod does not install anything. It is merely a tool to make software installed through other means (we use EasyBuild) available to the users. By design, modules are made to be reversible, i.e. what you load, you can unload, always. Shell scripts, that add things to your "PATH" or "LD_LIBRARY_PATH" (which, by the way, we don't define for our stack), are not reversible.

When system administrators install packages, they are used by all users. We have about 15 000 users on our infrastructures. Can you imagine the mess if every single one of them installed R through conda ? First, it would create a hell of a lot of files, second this installation would not be optimized for our infrastructure (something we, as system administrators, are uniquely qualified to do).

And I bet for lmod you put something in /etc/profile.d/ to install it globally? conda does the same, if installed globally.

So the problem isn't that is touches .bashrc, it must be something else.

The global configuration is one configuration. This is vastly different from having hundreds or thousands of users doing a different configuration.

virtual environments have somewhat the same flaw as Conda (although to a much lesser extent, since it's just python and not the world

Compiled C code is not "the world". There are lots of perfectly serviceable libraries and apps written in non-C. Lots of people spend a lot of their lives maintaining those ecosystems.

And it's possible to ship and load arbitrary compiled DLLs with python. They can override arbitrary C functions for you app.

But what they can't do -- and neither can conda -- is wreck the underlying OS, because they live in their own PREFIX! So what's the difference?

Conda of course can't wreck the underlying OS... but it sure can (and do) wreck the end users' session by adding a lot of incompatible tools.

That's not an issue, we already provide thousands of those.

You really really do. I just looked over the list in detail and it's really impressive. It has things I was surprised you had covered.

But it's not a complete mirror though. It's missing these wheels:

futures

ivadomed

pyqt5

requirements-parser

transforms3d

mpld3

albumentations

phantominator

and we add those as needed by our users.

(also you've renamed a bunch of things by replacing - with _; won't those changes require you editing our setup.pys??)

No. We have not. That is just standard python packaging. "-" and "_" are equivalent for pip. When building a wheel, it writes it down in the canonical form.

Without fully automated pypi mirroring, or at least the ability to build wheels as needed, we can't self-manage our builds. We have to bug you to get stuff updated, or turn to workarounds like conda.

I guess we /could/ make our own wheelhouse. But who knows how compatible it would be; it sounds like you have python-specific compiler flags too.

Alternatives will fail, precisely because you can't test on every end user's computers.

conda keeps its own $PREFIX and isolates everything there. Except it doesn't. It still depend on system libraries, the linux loader, and this is just not portable. Your mileage may vary. Our stack does not depend on your libstdc++, nor on your glibc, not even on your linux loader. We provide it all. The only things it depends on are the linux kernel and the hardware drivers. Conda makes assumptions about a Linux system that simply aren't true all the time.

It sounds like that's your real issue with conda: it's not the package manager, it's the distro attached to it. Those binaries are superficially compatible with ComputeCanada, but not really, giving people strange and confusing bugs that you're stuck debugging for them. Deploying binaries compiled on a personal computer to a high-performance cluster computer is always going to be a source of problems.

It is very hard to make binary packages run everywhere, and conda is no different. The difference is that it tries to do something it can't, contrary to other package managers. Package managers that work well either build stuff from source (like Gentoo Prefix, EasyBuild, Nix, Spack), or they restrict to a given distribution (like apt-get, yum, etc). Conda is trying and failing to ship binary packages

I see two ways to negotiate out of this:

Like you do with pip, provide your own conda channel where you run all their build scripts on your compilers; install conda globally and configure it to use that channel by default. Keep your warning about using the Anaconda or conda-forge binaries on ComputeCanada.

Provide an open source repo where people can contribute Modulefiles for ComputeCanada. Make sure people can build and test (and use! and share the build scripts without vetting!) these in their homedirs so they don't need to wait for our distro team to get through your backlog. And the same for your python wheelhouse: I want to be able to contribute to it so I can help maintain what packages are available.

Maybe there's another?

Yes, make the code easily compilable from source. We have litterally installed over 5000 packages in 3 years... that's an average of ~5 per working day... the backlog is only really long when application developers don't make their stuff easy to compile from source.

By the way, git-annex/8.something was installed two days ago...

mboisson commented 4 years ago

I am worried about conda becoming the platform as much as I am worried about docker being the platform -- I prevu that it will work great for a while and then collapse like a house of cards when hardware trends shift again. I think the only way to insulate against that is to build good cross-platform systems from the start, and to encourage complete builds from source.

Yes. There is code out there that have been around for 2 decades, and they still run fin.... because we can build them from source.

In fact, don't the same points apply to docker and singularity? Despite being /more/ isolated they still depend on the syscalls the kernel provides, and the hardware available underneath. It's rare you can run a docker image from two years ago on this years' machines, and the software in them is generally going to be stuff pulled out of debian or alpine -- stuff compiled for the PC, not HPC. Singularity must have the same problem. How could it not?

Docker and singularity ship a complete OS. They only depend on a small number of kernel modules, and those tend to be backward compatible. And if they were not backward compatible, Docker and Singularity developers could support them by releasing a new version, while still running the same images.

But yes, containers are not a cure, they are a symptom.

mboisson commented 4 years ago

I don't understand how lmod has better performance characteristics than conda. It, presumably, also exiges reading lots of small files, depending on the software loaded.

Maybe you can adapt distr1's idea of distributing packages as mountable squashfses?

As I said before, Lmod does not install anything. It merely provides an interface to what's been installed through other means. But those other means are under the control of system administrators, which means we know where and how to install things so that it works well on our infrastructure. In the case of the software Compute Canada provides, we install software on CVMFS, which is a geographically distributed and redundant system with multiple layers of cache. It basically offers similar performance to a local disk (because there is a cache on the local disk) when it comes to small IO. This can only work because we have central installations. The average users don't have access to such infrastructure. Some very advanced groups (like CERN, SNO+, and some large bioinformatics group) do maintain such infrastructure, but not the average user.

mboisson commented 4 years ago

Okay now big thought #2:

We don't change anything in the code unless it's broken. We merely build wheels from the source code. So assuming you do a good job of providing code that is easy to build, it should not be an issue.

If you are doing python, then just make sure that pip wheel <your package> builds easily, and it's no issue for us. We will build the wheel from source and put it in our wheelhouse.

I don't think we can. It's more complicated than that. I've been having a discussion with scipy and scikit-image over their build-time dependencies and they taught me that

scikit-image/scikit-image#4261 (comment)

building scikit-image with a newer version of numpy than the user had installed in their current environment [...] would result in strange and confusing bugs.

neuropoly/spinalcordtoolbox#2841 (comment)

This just means that the building a wheel forscikit-image package needs to add a strict requirement on the version of numpy.

The problem with allowing users to build on unsupported platforms is that while thye may be able to install things, it is totally unclear if it will work as intended, or what bugs might occur.

Similarly, the problem with allowing installing unsupported binaries is that while they may be able to be copied, it is totally unclear if it will work as intended, or what bugs might occur.

So running pip wheel . isn't enough when there are build dependencies like numpy or cython involved; there's this pyproject.toml thing now whose main purpose is to declare build depends, but it's buggy:

Oh, common... it may or may not be buggy, but it is used by hundreds of other packages.... scikit-image isn't that special of a snowflake. build dependencies are a thing, and they are used for many other python packages.

Similarly, building with different compilers or compiler options can change behaviour subtly. It shouldn't but there's a steep stack and I don't think it's numerically stable enough yet. My understanding is that that's why people are migrating to containers and virtual environments, because they've decided they can't really trust dependencies to stay stable: a minor change in some library can have unintended consequences downstream.

Again, this has been done for multiple decades by thousands of software packages. If specific options are needed to compile, then there are ways to specify that in your Makefile, Autoconf or setup.py or whatever is used to build. If specific compilers are needed, it is also possible to specify that. But to be honest, if your C or C++ code requires a very specific version of a specific compiler, you are probably relying on a compiler bug that you should not be relying on.

kousu commented 4 years ago

Thanks for this discussion, Maxime. It has been enlightening to see what issues come up from the perspective of a distro/sysadmin and to just clarify my thoughts on the whole thing, and will help me inform the team as we go forward with packaging our software for use outside the lab.