samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

Contributing to bioconda? #630

Open lh3 opened 6 years ago

lh3 commented 6 years ago

There have been several github issues related to the installation of samtools/htslib. This is mainly because samtools requires a few non-trivial dependencies for advanced features such as https/htsget and bzip2/xz compression. In fact, even for me, it is challenging to compile all these features on old Linux distributions (e.g. CentOS6 used here) without the root privilege.

I wonder if we could take over the responsibility for the htslib, samtools, bcftools and tabix recipes and recommend bioconda when users have similar issues again. I understand changing these recipes to the exact way we prefer may take some efforts, but I think it is worth doing and will help the deployment of samtools in future. There have been >200k samtools downloads at bioconda, comparable to the total downloads from the github release page. Bioconda is an important avenue to distributing samtools.

I am copying to a few past maintainers of the samtools recipe in case they want to chime in. cc: @chapmanb @dpryan79 @notestaff @johanneskoester

kyleabeauchamp commented 6 years ago

FWIW, I've generally updated the HTSLib bioconda recipes once things have been pulled into pysam (by @andreasheger) and stabilized. I also find that using miniconda+bioconda helps folks get their CI systems (e.g. Travis) working with much less environment customization.

Also: we have a preprint on the Bioconda ecosystem (https://www.biorxiv.org/content/early/2017/10/27/207092) in case anyone wants to read the big picture without thinking about technical details.

notestaff commented 6 years ago

"changing these recipes to the exact way we prefer may take some efforts" -- what changes are needed? Is there a danger that they'll impact other packages negatively?

In bioconda you can include multiple versions of a recipe, so you can add custom versions without impacting the main version, if necessary.

lh3 commented 6 years ago

There might be a few potential issues with the bioconda recipes:

  1. bcftools and tabix are not updated.
  2. samtools doesn't depend on htslib.
  3. "make test" is not used, if I am right.
  4. pysam depends on samtools executable. I am not sure why.
  5. samtools tview not working, due to a known issue. Don't know if it is fixable.
  6. samtools is not compiled with google cloud and S3 support, if I am right.

I could be wrong at some of these, and I don't know what the samtools team think about them.

kyleabeauchamp commented 6 years ago

Good list of issues, one more I would add is just ensuring version consistency between the tree of HTSLib dependent children, and clarifying when tools are static builds versus relying on exported shared libraries.

dpryan79 commented 6 years ago

No one is going to complain if you want to update the bioconda recipes :)

Regarding your issue list:

  1. Easy enough to fix
  2. The tarball on github comes with htslib. This could be changed easily enough (honestly, it'd make sense to change this).
  3. There are minimal tests done in most recipes simply for the sake of time. The recipes used to run into TravisCI timeout issues, so tests had to be quite minimal. Having said that, builds are a lot faster now.
  4. There at least used to be a few pysam functions (e.g., idxstats) that I think used samtools directly (they were capturing stdout).
  5. There have been a number of issues with packages in the default channel (libcurl didn't even work for a while, which is why bioconda used to host a version). ncurses is also available from conda-forge, so that's easy enough to update.
  6. Presumably this could be changed, though it might require adding the dependencies to conda-forge.
AndreasHeger commented 6 years ago

Thanks. More info about point 4: pysam depends on the samtools code, which is distributed and compiled alongside pysam. The samtools executable itself is not built. There is no dependency on an external samtools executable other than for regression testing.

jkbonfield commented 6 years ago

I may be in the minority, but I couldn't actually work out what bioconda is from their fluffy (but trendy style) home page. It claims to be cloud package management, but when I install a cloud based VM I install an OS too, be it debian, ubuntu, centos, etc, all of which have package managers.

Where does conda fit in to all of this?

Edit: I see now - anaconda is the fluffy worthless website. Bioconda one is actually informative. Thanks :-)

kyleabeauchamp commented 6 years ago

From the paper abtract: "We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges."

kyleabeauchamp commented 6 years ago

Conda (https://conda.io/docs/): "Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language."

lh3 commented 6 years ago

Conda is a package manager for precompiled binary packages, but is linux-distribution agnostic and does not require the root privilege. Bioconda is built on top of conda and provides bioinfo-related tools through the conda tool chain.

Endusers are likely to use bioconda as follows:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh  # you will be asked about the path to install conda
. ~/.bashrc  # or whatever, to update PATH
conda install -c bioconda samtools

Updating an existing conda recipe is easy IMO. You fork and clone the bioconda-recipes repo. Edit YAML to update version number, download link and hash. Send a pull request to the repo. Travis will build the code. If it passes, someone from bioconda will merge and endusers will be able to see the updated version.

I have not gone through the local building route, which seems more complex and apparently needs docker.

jkbonfield commented 6 years ago

Ah ok, it's the "does not require root privs" bit which makes it popular then. I understood it was a package manager, but not really what distinguished it from all the other package managers out there and why it is so popular for samtools. I find it rather confusing they push the cloud aspect so heavily given cloud environments are the one scenario where most users do have root access. However if it works everywhere and is system agnostic then I can give it a whirl on our servers to see what this entails.

So what really is the crux of this request, long term? Is it to update the version number in recipes when we do a release, so we can get bioconda updated rapidly rather than waiting for some third party to notice? I see some specific points which we can look at fixing.

daviesrob commented 6 years ago

This sounds like a good idea. We'll have a play and see if we can send bioconda a pull request.

johanneskoester commented 6 years ago

@lh3 that sounds like a great idea. We are happy to support you in any way. Any changes to the recipes that improve the current situation are very welcome.

Regarding local builds: usually, it is enough to just test a recipe with conda build and then let our CI do the (more complicated) docker based building and testing. Especially if you only modify an already existing recipe.

jkbonfield commented 6 years ago

What's the conda philosophy when it comes to supporting different OSes? Is it to go with lowest common denominator, or to build whatever each system supports?

Eg the curses part (samtools tview) appears to be disabled (https://github.com/bioconda/bioconda-recipes/blob/master/recipes/samtools/build.sh) for all OSes on the grounds that ancient Centos distributions don't work properly. To me this feels like the wrong strategy.

The right approach is to fix the configure script so that it correctly works and copes on all the various OS outliers, rather than hacking the recipe to work around the "upstream" deficiencies. I suspect this issue has already been fixed, as has the need for rdynamic configure sed hackery.

Put simply, I'd rather be supporting conda by making the samtools build environment work for all, rather supporting it by editing the recipes to work around build problems.

johanneskoester commented 6 years ago

The ncurses issue is something temporal as far as I know. We disabled it as an immediate fix, but people are working to solve it on conda-forge.

We build for CentOS 6 because conda package rely on system libc, which is upwards compatible. This means, that we only need to build for one Linux os and it will work on all others. Indeed, it is true that this causes problems sometimes. On the other hand, there are a lot of HPC clusters around that use CentOS 6. So in order to support them, one would have to solve the problems on CentOS 6 anyway.

When CentOS 6 runs out of support, we will move to CentOS 7.