joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
586 stars 186 forks source link

import_biom: Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : HDF5. Dataset. Read failed. #1143

Closed dswan closed 5 years ago

dswan commented 5 years ago

So this has been confusing me today, I suspect the issue doesn't lie immediately on phyloseq's doorstep, but I'm having a little trouble pinning it down. I have biom files that have been generated by QIIME2. As part of regression testing the 2019.4 release, importing the previously fine biom files doesn't work any more.

Sample code:

print ("Load libraries")
library(rhdf5)
library(phyloseq)
sessionInfo()
print ("open BIOM file via H5Fopen")
openTheBIOM <- H5Fopen("merged-feature-table.biom")
print ("Results:")
openTheBIOM
print ("closing BIOM file")
if(utils::packageVersion('rhdf5') < "2.23.0") {
        H5close()
    } else {
        h5closeAll()
}
print ("import the BIOM file via import_biom")
importTheBIOM <- import_biom("merged-feature-table.biom")
print ("Results:")
importTheBIOM

Output from conda environment constructed around release 2019.1:

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /home/ubuntu/miniconda3/envs/qiime2-2019.1/lib/R/lib/libRblas.so
LAPACK: /home/ubuntu/miniconda3/envs/qiime2-2019.1/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base

other attached packages:
[1] phyloseq_1.22.3 rhdf5_2.22.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0          compiler_3.4.1      pillar_1.3.1
 [4] plyr_1.8.4          XVector_0.18.0      iterators_1.0.10
 [7] tools_3.4.1         zlibbioc_1.24.0     jsonlite_1.6
[10] tibble_2.0.1        nlme_3.1-137        gtable_0.2.0
[13] lattice_0.20-38     mgcv_1.8-26         pkgconfig_2.0.2
[16] rlang_0.3.1         igraph_1.2.2        Matrix_1.2-15
[19] foreach_1.4.4       parallel_3.4.1      stringr_1.3.1
[22] cluster_2.0.7-1     Biostrings_2.46.0   S4Vectors_0.16.0
[25] IRanges_2.12.0      multtest_2.34.0     stats4_3.4.1
[28] ade4_1.7-13         grid_3.4.1          Biobase_2.38.0
[31] data.table_1.12.0   survival_2.43-3     reshape2_1.4.3
[34] ggplot2_3.1.0       magrittr_1.5        splines_3.4.1
[37] scales_1.0.0        codetools_0.2-16    MASS_7.3-51.1
[40] BiocGenerics_0.24.0 biomformat_1.6.0    permute_0.9-4
[43] ape_5.2             colorspace_1.4-0    stringi_1.2.4
[46] lazyeval_0.2.1      munsell_0.5.0       vegan_2.5-3
[49] crayon_1.3.4
[1] "open BIOM file via H5Fopen"
[1] "Results:"
HDF5 FILE
        name /
    filename

         name     otype dclass dim
0 observation H5I_GROUP
1 sample      H5I_GROUP
[1] "closing BIOM file"
[1] "import the BIOM file via import_biom"
Warning message:
In strsplit(msg, "\n") : input string 1 is invalid in this locale
[1] "Results:"
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 461 taxa and 12 samples ]
tax_table()   Taxonomy Table:    [ 461 taxa by 7 taxonomic ranks ]

Output from conda environment constructed around 2019.4:

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/ubuntu/miniconda3/envs/qiime2-2019.4/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] phyloseq_1.24.2 rhdf5_2.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1          compiler_3.5.1      pillar_1.3.1
 [4] plyr_1.8.4          XVector_0.22.0      iterators_1.0.10
 [7] tools_3.5.1         zlibbioc_1.28.0     jsonlite_1.6
[10] tibble_2.1.1        nlme_3.1-139        gtable_0.3.0
[13] lattice_0.20-38     mgcv_1.8-28         pkgconfig_2.0.2
[16] rlang_0.3.4         igraph_1.2.4.1      Matrix_1.2-17
[19] foreach_1.4.4       parallel_3.5.1      stringr_1.4.0
[22] cluster_2.0.9       Biostrings_2.50.2   S4Vectors_0.20.1
[25] IRanges_2.16.0      multtest_2.36.0     stats4_3.5.1
[28] ade4_1.7-13         grid_3.5.1          Biobase_2.42.0
[31] data.table_1.12.2   survival_2.44-1.1   reshape2_1.4.3
[34] Rhdf5lib_1.2.1      ggplot2_3.1.1       magrittr_1.5
[37] splines_3.5.1       scales_1.0.0        codetools_0.2-16
[40] MASS_7.3-51.4       BiocGenerics_0.28.0 biomformat_1.8.0
[43] permute_0.9-5       ape_5.3             colorspace_1.4-1
[46] stringi_1.4.3       lazyeval_0.2.2      munsell_0.5.0
[49] vegan_2.5-4         crayon_1.3.4
[1] "open BIOM file via H5Fopen"
[1] "Results:"
HDF5 FILE
        name /
    filename

         name     otype dclass dim
0 observation H5I_GROUP
1 sample      H5I_GROUP
[1] "closing BIOM file"
[1] "import the BIOM file via import_biom"
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
In addition: Warning message:
In strsplit(conditionMessage(e), "\n") :
  input string 1 is invalid in this locale
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  HDF5. Dataset. Read failed.
Error in read_biom(biom_file = BIOMfilename) :
  Both attempts to read input file:
merged-feature-table.biom
either as JSON (BIOM-v1) or HDF5 (BIOM-v2).
Check file path, file name, file itself, then try again.
Calls: import_biom -> read_biom
Execution halted

So the obvious differences are R (3.4.1 > 3.5.1), phyloseq (1.22.3>1.24.2) and rhdf5 (2.22.0>2.24.0). I've tried to dig a bit more into this, and suspect it's an issue with various bits and pieces tied to the QIIME2 conda environment, but haven't been able to solve it (yet).

dswan commented 5 years ago

As expected this was just dependency issues. Did some in-place upgrades in the conda environment to R and BioConductor packages. Installed rhdf5lib and rhdf5 from conda. Seems to have worked:

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home/ubuntu/miniconda3/envs/qiime2-2019.4/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] phyloseq_1.26.1   rhdf5_2.28.0      biomformat_1.10.1