HDF5 files not being read properly (hdf5 lib incompatibility, or base R and rhdf5 clash?)

channsoden commented 6 years ago

sleuth_prep seems to always error with "subscript out of bounds" during step "normalizing est_counts". I'm trying this on my own data since, as others have pointed out, the link to the sample data is broken.

design <- read.table(file.path(".", "metadata_table.tsv"),
                      header = TRUE,
                      stringsAsFactors=FALSE)
s2c <- dplyr::select(design, sample, condition = carbon, path)
s2c

sample | condition | path
-- | -- | --
MM10_16C_L_wt | L | wt/MM10_16C_L_corrected_kallisto
MM10_16C_R_wt | R | wt/MM10_16C_R_corrected_kallisto
MM10_22C_L_wt | L | wt/MM10_22C_L_corrected_kallisto
MM10_22C_R_wt | R | wt/MM10_22C_R_corrected_kallisto
MM18_16C_L_wt | L | wt/MM18_16C_L_corrected_kallisto
MM18_16C_R_wt | R | wt/MM18_16C_R_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_1_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_2_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_3_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_corrected_kallisto
MM18_22C_R_wt | R | wt/MM18_22C_R_corrected_kallisto
MM20_16C_L_wt | L | wt/MM20_16C_L_corrected_kallisto
MM20_16C_R_wt | R | wt/MM20_16C_R_corrected_kallisto
MM20_22C_L_wt | L | wt/MM20_22C_L_corrected_kallisto
...

so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE)

reading in kallisto results
dropping unused factor levels
..............................................................................................
normalizing est_counts
Error in result[, which_order, drop = FALSE]: subscript out of bounds
Traceback:

1. sleuth_prep(s2c, extra_bootstrap_summary = TRUE)
2. spread_abundance_by(obs_raw, "est_counts", sample_to_covariates$sample)

Any advice on troubleshooting this?

warrenmcg commented 6 years ago

Hi @channsoden,

Thanks for using the tool, and for reporting your issue. To start identifying the problem, I would suggest running the following code, which is using some of the guts of sleuth_prep to help (assume you have s2c with path already created):

## collect the data from your kallisto results
kal_list <- lapply(seq_along(s2c$path), function(i) {
  path <- s2c$path[i]
  suppressMessages(kal <- read_kallisto(path, read_bootstrap = FALSE))
  kal$abundance <- dplyr::mutate(kal$abundance, sample = s2c$sample[i])
  kal
})
## make the 'obs_raw' data frame
obs_raw <- dplyr::bind_rows(lapply(kal_list, function(k) k$abundance))
## FIRST CHECK
stopifnot(all(s2c$sample %in% obs_raw$sample))
## THE ABOVE SHOULD PASS; IF IT FAILS, RUN:
s2c$sample[which(!(s2c$sample %in% obs_raw$sample))]
## to ID which samples are causing the problem
## This is the problem

## If the above passes, then that isn't the issue
## now run the code below (guts of 'spread_abundance_by'):
abund <- data.table::as.data.table(obs_raw)
var_spread <- data.table::dcast(abund, target_id ~ sample, 
        value.var = "est_counts")
var_spread <- var_spread[order(var_spread$target_id), ]
var_spread <- as.data.frame(var_spread, stringsAsFactors = FALSE)
rownames(var_spread) <- var_spread$target_id
var_spread["target_id"] <- NULL
result <- as.matrix(var_spread)
s2c$sample[which(!(s2c$sample %in% colnames(result)))]
### The above line should identify which samples are causing a problem
### If this doesn't yield anything, then we have something very unusual happening

Let me know if that helps identify the problematic samples.

channsoden commented 6 years ago

Thank you for your prompt reply!

It does indeed fail at the first check - obs_raw is entirely empty.

channsoden commented 6 years ago

The read_kallisto function doesn't look like it's succeeding. e.g.:

read_kallisto("wt/W795_22C_L_corrected_kallisto", read_bootstrap = FALSE) kallisto object

transcripts: 0 bootstraps: 0

warrenmcg commented 6 years ago

It seems like the kallisto files might be empty, or otherwise something went really wrong with the input process.

If the files are empty, though that pushes the troubleshooting for you back to the kallisto step, it's still important for us because it means we should put a sanity check to make sure the file has something in it.

If the kallisto file is present and an appropriate size (a few MBs usually), would you mind sending me an example kallisto file for me to do some additional troubleshooting on my end?

channsoden commented 6 years ago

The files don't look empty to me. The tsvs look sane enough, and the h5s are several MB. The error logs for the kallisto runs all looked normal enough to me too.

Here's the above sample: https://drive.google.com/file/d/1puBadjckcFas0pDHZsf4UOdlYVFdMvP-/view?usp=sharing

warrenmcg commented 6 years ago

Hi @channsoden,

If the kallisto log files do not show any errors, then the abundance.h5 files are corrupted. The one you sent me can be read if you do the following line:

test_kal <- rhdf5::h5read('wt/W795_22C_L_corrected_kallisto/abundance.h5', '/')
rhdf5::H5close()

When I did this, I first of all got several read errors. I also found at least three things that stood out as troubling: 1) test_kal$est_counts is NULL, which it should not be. 2) test_kal$aux$ids was also empty. 3) test_kal$aux$lengths was also showing very weird results.

As I have seen with other issues, it's probably not an issue with kallisto or sleuth, but with whatever cloud storage you're using to transfer files around. Let me know if you get different results when you try the h5read code above out on your end.

channsoden commented 6 years ago

The fields of test_kal are entirely NULL. Weird. I'll run a checksum on the original files, maybe try running Kallisto over again to see if I get something different.

warrenmcg commented 6 years ago

The odd thing for me is that your abundance.tsv file that you sent me does not have anything grossly wrong with it. It's just your abundance.h5 file. My guess is that it's a sync/cloud issue, but feel free to report an error to the kallisto team if the problem persists and you can confirm it's not related to syncing / cloud storage problems.

channsoden commented 6 years ago

FYI, the checksums check out, so if they got corrupted it was on my serve filesystem somehow.

pimentel commented 6 years ago

@channsoden This is definitely a kallisto issue. I've downloaded the files and played with them a bit and am getting a lot of NULL values.

But before we open report a bug there, I noticed you are running version 0.43.0. Any chance you can update to the latest v0.44.0 and rerun?

Thanks,

Harold

channsoden commented 6 years ago

I repeated the run with v.0.44.0 to the same effect. Should I open up a bug for Kallisto? I can provide you with additional sample data for the 0.44.0 run.

pimentel commented 6 years ago

Hi @channsoden,

thanks for verifying. Yes, please open a kallisto bug: https://github.com/pachterlab/kallisto/issues

Things that will be important:

Noting the reference to build the index
System info (e.g. OS, GCC version)
HDF5 version
If you built from source or if you used the pre-built binary
ld $(which kallisto)
An example run used v0.44.0

Thanks!

Harold

pimentel commented 6 years ago

Sorry, that's supposed to be;

ldd $(which kallisto)

pimentel commented 6 years ago

@channsoden I looked at your files in the kallisto bug report and I'm not having any issues with these. Is the error the same or different now? Note: with your old files I couldn't read them, these ones I can. Is it possible you didn't update the path in "metadata_table.tsv"? (I only ask because I am guilty of doing this sort of thing myself)

Here is what I get:

R> Sys.glob('*/abundance.h5')
[1] "MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5"
[2] "MM2_16C_L_A100-A101-A102_L002_corrected_kallisto/abundance.h5"
[3] "MM2_16C_L_A100-A101-A102_L003_corrected_kallisto/abundance.h5"
R> x = Sys.glob('*/abundance.h5')
R> tmp = lapply(x, read_kallisto)
Found 100 bootstrap samples
Found 100 bootstrap samples
Found 100 bootstrap samples
R> tmp[[1]]
    kallisto object

transcripts:  10345
original number of transcripts:  10345
Original or Subset:  Subsetbootstraps:  100
R> tmp[[1]]$abundance %>% head
     target_id est_counts  eff_len  len        tpm
1 NEUDI_100023        199 1849.681 2060   6.390675
2 NEUDI_100042        415 1148.070 1309  21.471897
3 NEUDI_100045        400 3507.601 3708   6.773926
4 NEUDI_100052       4863 1643.335 1866 175.779709
5 NEUDI_100057        857 1519.923 1697  33.492676
6 NEUDI_100075       1695 2522.850 2703  39.908822

Everything looks fairly normal now?

channsoden commented 6 years ago

@pimentel Yes, I am still getting the exact same errors. I put the new files in same path as before, and when I do

test_kal <- rhdf5::h5read('MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5', '/')  
rhdf5::H5close()

the h5read function doesn't throw any errors but test_kal is NULL in all fields.

pimentel commented 6 years ago

ah, not the case for me. it definitely works and looks correct for me now.

in some ways this is good and bad. Perhaps good for me since it means there probably isn't a hdf5 bug in kallisto. bad for you because it means that there's something weird with your rhdf5.

do you have the H5 utilities installed on that machine? if so, what do you get when you type()

h5dump -d /est_counts abundance.h5

channsoden commented 6 years ago

Hey, thanks for helping me trouble-shoot this. Good to narrow down the source of the problem - I may try a different machine to see if I can replicate.

The dump seems to work:

HDF5 "abundance.h5" { DATASET "/est_counts" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 10345 ) / ( 10345 ) } DATA { (0): 202, 310, 306, 2492, 301, 1077, 466, 398, 1242, 214, 1373, 358, 302, (13): 203, 766, 3540, 898, 2216, 764, 1467, 261, 358, 1279, 517, 29, 123, (26): 138, 451, 118, 150, 565, 6141, 558, 126, 235, 5, 174, 255, 304, 50, (40): 31, 287, 204, 436, 59, 1636, 780, 63, 2593, 351, 292.011, 852, 1295, (53): 50, 465, 170, 165, 10, 301, 440, 121, 242, 975, 146, 562, 2486, 260, . . . (10340): 0, 0, 0, 0.2, 0 } } }

warrenmcg commented 6 years ago

@channsoden, what version of rhdf5 do you have installed? In other words, it will be helpful to see your sessionInfo() output after the error.

channsoden commented 6 years ago

This affects me with rhdf5 2.22.0 and 2.25.3.

Here's my sessionInfo immediately after the error:

R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS

Matrix products: default BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] bindrcpp_0.2.2 sleuth_0.29.0 dplyr_0.7.5
[4] ggplot2_2.2.1 RevoUtils_10.0.8 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.17 compiler_3.4.3 pillar_1.2.3
[4] plyr_1.8.4 bindr_0.1.1 base64enc_0.1-3
[7] tools_3.4.3 digest_0.6.15 uuid_0.1-2
[10] rhdf5_2.25.3 jsonlite_1.5 evaluate_0.10.1
[13] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1
[16] rlang_0.2.1 IRdisplay_0.5.0 parallel_3.4.3
[19] IRkernel_0.8.12.9000 repr_0.15.0 stringr_1.3.1
[22] grid_3.4.3 tidyselect_0.2.4 data.table_1.11.4
[25] glue_1.2.0 R6_2.2.2 pbdZMQ_0.3-3
[28] tidyr_0.8.1 Rhdf5lib_1.0.0 purrr_0.2.5
[31] magrittr_1.5 scales_0.5.0 htmltools_0.3.6
[34] assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.2
[37] lazyeval_0.2.1 munsell_0.4.3 crayon_1.3.4

warrenmcg commented 6 years ago

Interesting, I'm running rhdf5 version 2.18.0. If Harold is also running an older version, this would put the newer versions of rhdf5 as the cause. Still don't know the exact nature of the cause, but it seems we're getting closer?

warrenmcg commented 6 years ago

@pimentel and @channsoden, after some thinking, here's an idea:

kallisto uses whatever system hdf5 library is available (on a linux, which h5cc would show that path and h5cc -showconfig | head -6 would show the version number)
after digging into rhdf5, it uses its own version of the hdf5 library internally (this can be checked by running rhdf5::h5version())

Chris, if you run those, do you get different versions?? If my hypothesis is correct, this bug is occurring because the system hdf5 lib (possibly 1.10?) is newer than the hdf5 lib (which will be 1.8.x) used by rhdf5, so it can't recognize the new-format HDF5 files, as discussed on the HDF5 website.

channsoden commented 6 years ago

You're on the nose with the versions.

Server (Kallisto): HDF5 Version: 1.10.1 Local (Sleuth): This is Bioconductor rhdf5 2.25.3 linking to C-library HDF5 1.8.19

So I need to install the new HDF5 library locally? Any guidance there, or shall I just start BASHing away?

warrenmcg commented 6 years ago

rhdf5 installs its own internally, so we either would have to ask the developer there to switch the package to hdf5 lib 1.10, or you would need to install the older version of hdf5 lib on your machine (1.8.x). What I don’t know is whether it works to have both versions on your machine and which kallisto will use. If you’re comfortable experimenting, that would be the next step.

@pimentel, from our end, I think the best thing to do is open an issue in kallisto to see if the hdf5 lib version can be included either in the h5 files themselves or the run info json file. It would also be worth it to include a warning of hdf5 lib is version 1.10, since it can’t currently be processed by sleuth. Then, sleuth could do a check to make sure that the rhdf5 internal hdf5 lib version is compatible with the version used by kallisto.

pimentel commented 6 years ago

ah, somehow I missed this issue:

https://github.com/pachterlab/kallisto/issues/171#issuecomment-387706627

I think we should really file something with the rhdf5 people. the way that kallisto works is that hdf5 is built into the binary

channsoden commented 6 years ago

@warrenmcg For my purposes installing hdf5 1.8.x seems untenable. I am working on an institutional cluster and would need to install it alongside 1.10 within my home directory. I foresee many headaches getting the environment configured correctly. . .

warrenmcg commented 6 years ago

@pimentel, what version of hdf5 is built into the prebuilt binaries that are downloadable? How could a user find this out? In the short-term, I think the only we can get @channsoden up and running as quickly as possible is to give him a pre-built binary built on hdf5 lib version 1.8.x. Could I send @channsoden my binary of kallisto, for example, which I presume is built on hdf5 lib 1.8.x?

pmelsted commented 6 years ago

The kallisto binary we distribute always uses hdf5 version 1.8.15

warrenmcg commented 6 years ago

Thanks for the info @pmelsted!

So, that argues against my idea that it’s a clash of hdf5 libs and leaves us with the weird issue described over at kallisto. @channsoden, could you confirm which version of R you are running? We can collect all of this and open an issue on rhdf5

channsoden commented 6 years ago

I am indeed running 3.4.3. Here's my sessionInfo():

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so
LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.28.0 RevoUtils_10.0.8     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] httr_1.3.1      compiler_3.4.3  R6_2.2.2        tools_3.4.3    
 [5] withr_2.1.2     curl_3.2        memoise_1.1.0   git2r_0.21.0   
 [9] digest_0.6.15   devtools_1.13.5

channsoden commented 6 years ago

I've been working with Mike Smith of rhdf5 to try to get it to use the system libraries, but I have so far not been able to get that installed. I have also been trying some other workarounds to no avail. Eventually I got fed up and just installed it within a Docker container. Success!

The image based on Jupyter's r-notebook, and is hosted on Docker Cloud. Use docker run -p 8888:8888 -v /path/to/kallisto/results:/home/jovyan/kallisto channsoden/sleuth to start the container. Then copy the token and point your browser to http://localhost:8888/?tocken=[TOKEN]. The directory mounted to /home/jovyan/kallisto is shared between the host and container filesystems, so changes or output written there is easy to grab.

warrenmcg commented 6 years ago

@channsoden: excellent work on what you did. Just to clarify, you built a new version of rhdf5 or kallisto? I have never used Docker, and I failed in my attempt to run your command, so I am unable to directly view what you did.

Just to be clear on two things:

we will link this issue to grimbough/rhdf5#18 to track progress on their rhdf5 end.
Originally, did you build kallisto from source, or did you use the pre-built binary? If you built it from source (using hdf5 lib 1.10.*), then, it was mentioned above that that is built with hdf5 lib 1.8.15, so that might be another short-term solution. If you did use the pre-built binary, then this problem probably has less to do with the hdf5 lib 1.8.* vs 1.10.* incompatibility, and more to do with bug being introduced by R 3.4.3.

channsoden commented 6 years ago

@warrenmcg: I have been taking Mike's directions to rebuild rhdf5, but haven't yet succeeded.

I originally used the kallisto pre-installed on my institution's cluster (probably pre-built). I then tried the latest version of the pre-built binary.

For the Docker image, Jupyter's r-notebook stack already utilizes conda, so I simply did the conda install on top of that stack and rebuilt it. Quite simply:

echo "FROM jupyter/r-notebook" > Dockerfile
echo "RUN conda install --channel bioconda r-sleuth" >> Dockerfile
sudo docker build -t sleuth .
sudo docker login
sudo docker tag sleuth channsoden/sleuth:14Jun2018
sudo docker push channsoden/sleuth:14Jun2018

Obviously this requires having set up a Docker Cloud account "channsoden" and created a repository "sleuth". The first 3 lines could be used to build it yourself, in which case you could run it as simply

sudo docker run -p 8888:8888 -v /path/to/kallisto_results:/home/joyvan/kallisto sleuth

The reason this works may be because the current Jupyter R notebook build uses R 3.4.1.

If this is something you wanted to maintain it would just need to be rebuilt periodically with new releases of the R notebook or Sleuth.

grimbough commented 6 years ago

For reference there's a branch of rhdf5 available at https://github.com/grimbough/rhdf5/tree/system_lib which should allow someone to specify the location of their own HDF5 library, rather than relying on the default version distributed in Rhdf5lib.

egenomics commented 5 years ago

Same issue here, can't make this work in a new installed R with the latests versions of everything. I am having a lot of troubles getting the different packages to work in earlier versions.

grimbough commented 5 years ago

Can you make the h5 file you're trying to read available somewhere? Maybe also include the output of sessionInfo() so we can see exactly what versions of the various packages you're using.

marcora commented 5 years ago

Same issue here with mro-base=3.5.1 installed using conda (hdf5=1.10.3, kallisto=0.45.0). All R packages where installed within R, not using conda itself.

Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : HDF5. Dataset. Read failed.


Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)

Matrix products: default
BLAS: /sc/orga/projects/LOAD/Dado/miniconda3/lib/R/lib/libRblas.so
LAPACK: /sc/orga/projects/LOAD/Dado/miniconda3/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sleuth_0.30.0        magrittr_1.5         forcats_0.4.0       
 [4] stringr_1.4.0        dplyr_0.8.0.1        purrr_0.3.0         
 [7] readr_1.3.1          tidyr_0.8.2          tibble_2.0.1        
[10] ggplot2_3.1.0        tidyverse_1.2.1      RevoUtils_11.0.1    
[13] RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        cellranger_1.1.0  pillar_1.3.1      compiler_3.5.1   
 [5] plyr_1.8.4        tools_3.5.1       rhdf5_2.26.2      jsonlite_1.6     
 [9] lubridate_1.7.4   gtable_0.2.0      nlme_3.1-137      lattice_0.20-38  
[13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
[17] parallel_3.5.1    haven_2.1.0       withr_2.1.2       xml2_1.2.0       
[21] httr_1.4.0        generics_0.0.2    hms_0.4.2         grid_3.5.1       
[25] tidyselect_0.2.5  data.table_1.12.0 glue_1.3.0        R6_2.4.0         
[29] fansi_0.4.0       readxl_1.3.0      Rhdf5lib_1.4.2    modelr_0.1.4     
[33] backports_1.1.3   scales_1.0.0      rvest_0.3.2       assertthat_0.2.0 
[37] colorspace_1.4-0  utf8_1.1.4        stringi_1.3.1     lazyeval_0.2.1   
[41] munsell_0.5.0     broom_0.5.1       crayon_1.3.4```

warrenmcg commented 5 years ago

Hi @channsoden,

For you and for any other user who comes here with a similar problem: if a user built R and rhdf5/rhdf5lib using bioconda, your issue may be related to the discussion on issue #120. There was a ton of work to uncover the problem (specifically, this comment proposed the underlying problem), and we think @johanneskoester's latest bioconda recipe for kallisto and rhdf5 are able to handle these issues. If you are still having issues, please try installing the latest bioconda recipe of these and try it out!

johanneskoester commented 5 years ago

To clarify @warrenmcg, when using the Bioconda packages, we are talking about installation. There is no building involved anymore, because they are precompiled by us and you just install the binaries.

jbedo commented 5 years ago

FWIW, I resolved this problem by backdating rhdf5 to grimbough/rhdf5@a9824faadf2bc9f2e23ba647aa77ed9f25236f9f and linking against HDF5 v1.8.19 (system install, not via Rhdf5lib).

pachterlab / sleuth

HDF5 files not being read properly (hdf5 lib incompatibility, or base R and rhdf5 clash?) #175