grimbough / Rhdf5lib

Distribution of the HDF5 library in an R package
https://bioconductor.org/packages/Rhdf5lib/
6 stars 14 forks source link

Rhdf5lib fails silently if zlib compression not supported (required for 1 use-case) #15

Closed warrenmcg closed 5 years ago

warrenmcg commented 5 years ago

Hi @grimbough,

I just wanted to bring your attention to an issue thread on sleuth's repository (pachterlab/sleuth#120). The root cause of this seems to originate with an edge case involving Rhdf5lib.

Users were getting a cryptic error message:

Error in H5Fopen(file, "H5F_ACC_RDONLY") : 
HDF5. File accessability. Unable to open file.

Because of some excellent bug-hunting and troubleshooting by @johanneskoester, we found the problem. Here is his explanation:

one issue that certainly occurs also when installing directly is that, if zlib headers are not found, rhdf5lib will silently compile without zlib compression support. Then, upon using it, you get these not very descriptive error messages posted here whenever reading a dataset with zlib compressed tables.

Sleuth reads kallisto-created HDF5 files, and kallisto uses gz compression to keep the files small, thus zlib compression is required. @johanneskoester did some excellent work creating a bioconda recipe (bioconda/bioconda-recipes#12780) to include a test case and make sure zlib compression is available.

I just wanted to bring this to your attention so that a similar check can be included in the bioconductor release of these packages.

grimbough commented 5 years ago

Thanks for the report. I assumed R would require zlib, and so it would be available, but my knowledge of bioconda is pretty small. I'll checking with that issue if I need more details.

I've also created a rhdf5 branch (https://github.com/grimbough/rhdf5/tree/H5Z) to try and provide a more helpful error message if it happens in the future. It seems a shame HDF5 doesn't report the problem more specificially itself, but there appears to be functionality for listing and checking filters.

warrenmcg commented 5 years ago

Great, thanks for checking in and for your work in improving rhdf5!

I'm pinging @johanneskoester, just so he's kept in the loop about further rhdf5 development. He may have some insight about why zlib might not be available with bioconda.

I'm not sure if the only users who have reported an error are installing R through bioconda, or if any are doing the native install and installing rhdf5 through bioconductor. I personally have never reproduced the error from my end (using a native-R/bioconductor build).

In any case, I think having a helpful error message will make it easier for sleuth to provide an informative error message for our users.

johanneskoester commented 5 years ago

With bioconda, we ensure that packages are built in an environment that has nothing available. This way, software usually complains when a dependency is missing and we can add it to the package (as a dependency in the metadata). If now, like here, the software does not complain, such things can happen. That's why I appreciate a lot that this behavior is now being changed.

This is e.g. the code of the rhdf5lib recipe:

{% set version = "1.4.2" %}
{% set name = "Rhdf5lib" %}
{% set bioc = "3.8" %}

package:
  name: 'bioconductor-{{ name|lower }}'
  version: '{{ version }}'
source:
  url:
    - 'https://bioconductor.org/packages/{{ bioc }}/bioc/src/contrib/{{ name }}_{{ version }}.tar.gz'
    - 'https://bioarchive.galaxyproject.org/{{ name }}_{{ version }}.tar.gz'
    - 'https://depot.galaxyproject.org/software/bioconductor-{{ name|lower }}/bioconductor-{{ name|lower }}_{{ version }}_src_all.tar.gz'
  md5: 05dbaf6e2c8cebfe41e4e5a98f8a75ed
  patches:
    - configure.patch
build:
  number: 3
  rpaths:
    - lib/R/lib/
    - lib/

requirements:
  host:
    - r-base
    - zlib
  run:
    - r-base
    - zlib
  build:
    - {{ compiler('c') }}
    - {{ compiler('fortran') }}
    - automake
    - make
test:
  commands:
    - '$R -e "library(''{{ name }}'')"'
about:
  home: 'https://bioconductor.org/packages/{{ bioc }}/bioc/html/{{ name }}.html'
  license: Artistic-2.0
  summary: 'Provides C and C++ hdf5 libraries.'

As you can see, all dependencies are listed explicitly, which allows conda to take care of installing them along with the rhdf5lib package.

grimbough commented 5 years ago

Thanks for the extra info. So from a bioconda team perspective, it would be sufficient for the configure script to check for a valid zlib and give a useful error message if one can't be found? I assume that for now the issue won't be encountered because your recipe now includes zlib as requirement?

johanneskoester commented 5 years ago

Yes to both questions :-).

johanneskoester commented 5 years ago

As we are talking about it. For successful compilation, we also had to patch configure like this.

grimbough commented 5 years ago

The zlib branch now contains an updated configure script that should check for the presence of the zlib library and fail early if it can't be found. This seems to work on Linux, OSX, and Solaris, but then so did the old version. There's also some modification to the flags for compiling SZIP, which will break your patch, so I'm trying to get a bioconda build working too.

However, I'm struggling to figure out how to test the bioconda implementation. I've been trying to follow the instructions at https://bioconda.github.io/contribute-a-recipe.html and everything seems to work ok, but since we know it will happily build the package without locating zlib or szip this doesn't seem sufficient.

Ideally I'd be able to view the output of R CMD INSTALL --build ., which I tried in the docker container produced during the recipe test, but things like the gcc toolchain aren't exported and it fails immediately. If you have any pointers on what I should run in the docker container prior to trying to build the R package it would be much appreciated.


I've also updated rhdf5 to try and help diagnose this more easily for other filters. There's now an internal function that will check when reading a dataset and output this if a filter is missing:

> rhdf5:::h5checkFilters(did)
Unable to read dataset. Not all required filters available.
Missing filters: blosc
johanneskoester commented 5 years ago

The recipe for bioconductor-rhdf5 that we have on bioconda-recipes, contains a test case that would fail if something goes wrong. So, you could build both packages, and if the second works as well, you are fine. Also, you can create a pull request, then, all that is done automatically for you in the CI.

roryk commented 5 years ago

This still misses the bioconda installed zlib when I try to install it with R installed via conda. zlib is there, but the configure script just checks /usr/local and bails out.

johanneskoester commented 5 years ago

The bioconda package for rhdf5lib should be correctly linked. But manual building via R on top of a conda installation will likely not see the dependencies (because they are not in the default system wide paths). When building the conda package, we have to explicitly tell the compiler where to look: https://github.com/bioconda/bioconda-recipes/tree/master/recipes/bioconductor-rhdf5lib/build.sh In addition, a patch is needed for the configure script: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/bioconductor-rhdf5lib/configure.patch

roryk commented 5 years ago

@johanneskoester thanks so much, somehow I missed that there was a bioconda package for rhdf5lib.

johanneskoester commented 5 years ago

Rule of thumb: there is a Bioconda package for everything :-).