Open channsoden opened 6 years ago
Hi @channsoden,
Thanks for using the tool, and for reporting your issue. To start identifying the problem, I would suggest running the following code, which is using some of the guts of sleuth_prep
to help (assume you have s2c
with path
already created):
## collect the data from your kallisto results
kal_list <- lapply(seq_along(s2c$path), function(i) {
path <- s2c$path[i]
suppressMessages(kal <- read_kallisto(path, read_bootstrap = FALSE))
kal$abundance <- dplyr::mutate(kal$abundance, sample = s2c$sample[i])
kal
})
## make the 'obs_raw' data frame
obs_raw <- dplyr::bind_rows(lapply(kal_list, function(k) k$abundance))
## FIRST CHECK
stopifnot(all(s2c$sample %in% obs_raw$sample))
## THE ABOVE SHOULD PASS; IF IT FAILS, RUN:
s2c$sample[which(!(s2c$sample %in% obs_raw$sample))]
## to ID which samples are causing the problem
## This is the problem
## If the above passes, then that isn't the issue
## now run the code below (guts of 'spread_abundance_by'):
abund <- data.table::as.data.table(obs_raw)
var_spread <- data.table::dcast(abund, target_id ~ sample,
value.var = "est_counts")
var_spread <- var_spread[order(var_spread$target_id), ]
var_spread <- as.data.frame(var_spread, stringsAsFactors = FALSE)
rownames(var_spread) <- var_spread$target_id
var_spread["target_id"] <- NULL
result <- as.matrix(var_spread)
s2c$sample[which(!(s2c$sample %in% colnames(result)))]
### The above line should identify which samples are causing a problem
### If this doesn't yield anything, then we have something very unusual happening
Let me know if that helps identify the problematic samples.
Thank you for your prompt reply!
It does indeed fail at the first check - obs_raw is entirely empty.
The read_kallisto function doesn't look like it's succeeding. e.g.:
read_kallisto("wt/W795_22C_L_corrected_kallisto", read_bootstrap = FALSE) kallisto object
transcripts: 0 bootstraps: 0
It seems like the kallisto files might be empty, or otherwise something went really wrong with the input process.
If the files are empty, though that pushes the troubleshooting for you back to the kallisto step, it's still important for us because it means we should put a sanity check to make sure the file has something in it.
If the kallisto file is present and an appropriate size (a few MBs usually), would you mind sending me an example kallisto file for me to do some additional troubleshooting on my end?
The files don't look empty to me. The tsvs look sane enough, and the h5s are several MB. The error logs for the kallisto runs all looked normal enough to me too.
Here's the above sample: https://drive.google.com/file/d/1puBadjckcFas0pDHZsf4UOdlYVFdMvP-/view?usp=sharing
Hi @channsoden,
If the kallisto log files do not show any errors, then the abundance.h5 files are corrupted. The one you sent me can be read if you do the following line:
test_kal <- rhdf5::h5read('wt/W795_22C_L_corrected_kallisto/abundance.h5', '/')
rhdf5::H5close()
When I did this, I first of all got several read errors. I also found at least three things that stood out as troubling:
1) test_kal$est_counts
is NULL
, which it should not be.
2) test_kal$aux$ids
was also empty.
3) test_kal$aux$lengths
was also showing very weird results.
As I have seen with other issues, it's probably not an issue with kallisto or sleuth, but with whatever cloud storage you're using to transfer files around. Let me know if you get different results when you try the h5read
code above out on your end.
The fields of test_kal are entirely NULL. Weird. I'll run a checksum on the original files, maybe try running Kallisto over again to see if I get something different.
The odd thing for me is that your abundance.tsv
file that you sent me does not have anything grossly wrong with it. It's just your abundance.h5
file. My guess is that it's a sync/cloud issue, but feel free to report an error to the kallisto team if the problem persists and you can confirm it's not related to syncing / cloud storage problems.
FYI, the checksums check out, so if they got corrupted it was on my serve filesystem somehow.
@channsoden This is definitely a kallisto issue. I've downloaded the files and played with them a bit and am getting a lot of NULL values.
But before we open report a bug there, I noticed you are running version 0.43.0. Any chance you can update to the latest v0.44.0 and rerun?
Thanks,
Harold
I repeated the run with v.0.44.0 to the same effect. Should I open up a bug for Kallisto? I can provide you with additional sample data for the 0.44.0 run.
Hi @channsoden,
thanks for verifying. Yes, please open a kallisto bug: https://github.com/pachterlab/kallisto/issues
Things that will be important:
ld $(which kallisto)
Thanks!
Harold
Sorry, that's supposed to be;
ldd $(which kallisto)
@channsoden I looked at your files in the kallisto bug report and I'm not having any issues with these. Is the error the same or different now? Note: with your old files I couldn't read them, these ones I can. Is it possible you didn't update the path in "metadata_table.tsv"
? (I only ask because I am guilty of doing this sort of thing myself)
Here is what I get:
R> Sys.glob('*/abundance.h5')
[1] "MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5"
[2] "MM2_16C_L_A100-A101-A102_L002_corrected_kallisto/abundance.h5"
[3] "MM2_16C_L_A100-A101-A102_L003_corrected_kallisto/abundance.h5"
R> x = Sys.glob('*/abundance.h5')
R> tmp = lapply(x, read_kallisto)
Found 100 bootstrap samples
Found 100 bootstrap samples
Found 100 bootstrap samples
R> tmp[[1]]
kallisto object
transcripts: 10345
original number of transcripts: 10345
Original or Subset: Subsetbootstraps: 100
R> tmp[[1]]$abundance %>% head
target_id est_counts eff_len len tpm
1 NEUDI_100023 199 1849.681 2060 6.390675
2 NEUDI_100042 415 1148.070 1309 21.471897
3 NEUDI_100045 400 3507.601 3708 6.773926
4 NEUDI_100052 4863 1643.335 1866 175.779709
5 NEUDI_100057 857 1519.923 1697 33.492676
6 NEUDI_100075 1695 2522.850 2703 39.908822
Everything looks fairly normal now?
@pimentel Yes, I am still getting the exact same errors. I put the new files in same path as before, and when I do
test_kal <- rhdf5::h5read('MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5', '/')
rhdf5::H5close()
the h5read function doesn't throw any errors but test_kal is NULL in all fields.
ah, not the case for me. it definitely works and looks correct for me now.
in some ways this is good and bad. Perhaps good for me since it means there probably isn't a hdf5 bug in kallisto. bad for you because it means that there's something weird with your rhdf5.
do you have the H5 utilities installed on that machine? if so, what do you get when you type()
h5dump -d /est_counts abundance.h5
Hey, thanks for helping me trouble-shoot this. Good to narrow down the source of the problem - I may try a different machine to see if I can replicate.
The dump seems to work:
HDF5 "abundance.h5" { DATASET "/est_counts" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 10345 ) / ( 10345 ) } DATA { (0): 202, 310, 306, 2492, 301, 1077, 466, 398, 1242, 214, 1373, 358, 302, (13): 203, 766, 3540, 898, 2216, 764, 1467, 261, 358, 1279, 517, 29, 123, (26): 138, 451, 118, 150, 565, 6141, 558, 126, 235, 5, 174, 255, 304, 50, (40): 31, 287, 204, 436, 59, 1636, 780, 63, 2593, 351, 292.011, 852, 1295, (53): 50, 465, 170, 165, 10, 301, 440, 121, 242, 975, 146, 562, 2486, 260, . . . (10340): 0, 0, 0, 0.2, 0 } } }
@channsoden, what version of rhdf5 do you have installed? In other words, it will be helpful to see your sessionInfo()
output after the error.
This affects me with rhdf5 2.22.0 and 2.25.3.
Here's my sessionInfo immediately after the error:
R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS
Matrix products: default BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=Cattached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] bindrcpp_0.2.2 sleuth_0.29.0 dplyr_0.7.5
[4] ggplot2_2.2.1 RevoUtils_10.0.8 RevoUtilsMath_10.0.1loaded via a namespace (and not attached): [1] Rcpp_0.12.17 compiler_3.4.3 pillar_1.2.3
[4] plyr_1.8.4 bindr_0.1.1 base64enc_0.1-3
[7] tools_3.4.3 digest_0.6.15 uuid_0.1-2
[10] rhdf5_2.25.3 jsonlite_1.5 evaluate_0.10.1
[13] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1
[16] rlang_0.2.1 IRdisplay_0.5.0 parallel_3.4.3
[19] IRkernel_0.8.12.9000 repr_0.15.0 stringr_1.3.1
[22] grid_3.4.3 tidyselect_0.2.4 data.table_1.11.4
[25] glue_1.2.0 R6_2.2.2 pbdZMQ_0.3-3
[28] tidyr_0.8.1 Rhdf5lib_1.0.0 purrr_0.2.5
[31] magrittr_1.5 scales_0.5.0 htmltools_0.3.6
[34] assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.2
[37] lazyeval_0.2.1 munsell_0.4.3 crayon_1.3.4
Interesting, I'm running rhdf5 version 2.18.0. If Harold is also running an older version, this would put the newer versions of rhdf5 as the cause. Still don't know the exact nature of the cause, but it seems we're getting closer?
@pimentel and @channsoden, after some thinking, here's an idea:
which h5cc
would show that path and h5cc -showconfig | head -6
would show the version number)rhdf5::h5version()
)Chris, if you run those, do you get different versions?? If my hypothesis is correct, this bug is occurring because the system hdf5 lib (possibly 1.10?) is newer than the hdf5 lib (which will be 1.8.x) used by rhdf5, so it can't recognize the new-format HDF5 files, as discussed on the HDF5 website.
You're on the nose with the versions.
Server (Kallisto): HDF5 Version: 1.10.1 Local (Sleuth): This is Bioconductor rhdf5 2.25.3 linking to C-library HDF5 1.8.19
So I need to install the new HDF5 library locally? Any guidance there, or shall I just start BASHing away?
rhdf5 installs its own internally, so we either would have to ask the developer there to switch the package to hdf5 lib 1.10, or you would need to install the older version of hdf5 lib on your machine (1.8.x). What I don’t know is whether it works to have both versions on your machine and which kallisto will use. If you’re comfortable experimenting, that would be the next step.
@pimentel, from our end, I think the best thing to do is open an issue in kallisto to see if the hdf5 lib version can be included either in the h5 files themselves or the run info json file. It would also be worth it to include a warning of hdf5 lib is version 1.10, since it can’t currently be processed by sleuth. Then, sleuth could do a check to make sure that the rhdf5 internal hdf5 lib version is compatible with the version used by kallisto.
ah, somehow I missed this issue:
https://github.com/pachterlab/kallisto/issues/171#issuecomment-387706627
I think we should really file something with the rhdf5 people. the way that kallisto works is that hdf5 is built into the binary
@warrenmcg For my purposes installing hdf5 1.8.x seems untenable. I am working on an institutional cluster and would need to install it alongside 1.10 within my home directory. I foresee many headaches getting the environment configured correctly. . .
@pimentel, what version of hdf5 is built into the prebuilt binaries that are downloadable? How could a user find this out? In the short-term, I think the only we can get @channsoden up and running as quickly as possible is to give him a pre-built binary built on hdf5 lib version 1.8.x. Could I send @channsoden my binary of kallisto, for example, which I presume is built on hdf5 lib 1.8.x?
The kallisto binary we distribute always uses hdf5 version 1.8.15
Thanks for the info @pmelsted!
So, that argues against my idea that it’s a clash of hdf5 libs and leaves us with the weird issue described over at kallisto. @channsoden, could you confirm which version of R you are running? We can collect all of this and open an issue on rhdf5
I am indeed running 3.4.3. Here's my sessionInfo():
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so
LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.28.0 RevoUtils_10.0.8 RevoUtilsMath_10.0.1
loaded via a namespace (and not attached):
[1] httr_1.3.1 compiler_3.4.3 R6_2.2.2 tools_3.4.3
[5] withr_2.1.2 curl_3.2 memoise_1.1.0 git2r_0.21.0
[9] digest_0.6.15 devtools_1.13.5
I've been working with Mike Smith of rhdf5 to try to get it to use the system libraries, but I have so far not been able to get that installed. I have also been trying some other workarounds to no avail. Eventually I got fed up and just installed it within a Docker container. Success!
The image based on Jupyter's r-notebook, and is hosted on Docker Cloud.
Use docker run -p 8888:8888 -v /path/to/kallisto/results:/home/jovyan/kallisto channsoden/sleuth
to start the container. Then copy the token and point your browser to http://localhost:8888/?tocken=[TOKEN]
. The directory mounted to /home/jovyan/kallisto is shared between the host and container filesystems, so changes or output written there is easy to grab.
@channsoden: excellent work on what you did. Just to clarify, you built a new version of rhdf5 or kallisto? I have never used Docker, and I failed in my attempt to run your command, so I am unable to directly view what you did.
Just to be clear on two things:
@warrenmcg: I have been taking Mike's directions to rebuild rhdf5, but haven't yet succeeded.
I originally used the kallisto pre-installed on my institution's cluster (probably pre-built). I then tried the latest version of the pre-built binary.
For the Docker image, Jupyter's r-notebook stack already utilizes conda, so I simply did the conda install on top of that stack and rebuilt it. Quite simply:
echo "FROM jupyter/r-notebook" > Dockerfile
echo "RUN conda install --channel bioconda r-sleuth" >> Dockerfile
sudo docker build -t sleuth .
sudo docker login
sudo docker tag sleuth channsoden/sleuth:14Jun2018
sudo docker push channsoden/sleuth:14Jun2018
Obviously this requires having set up a Docker Cloud account "channsoden" and created a repository "sleuth". The first 3 lines could be used to build it yourself, in which case you could run it as simply
sudo docker run -p 8888:8888 -v /path/to/kallisto_results:/home/joyvan/kallisto sleuth
The reason this works may be because the current Jupyter R notebook build uses R 3.4.1.
If this is something you wanted to maintain it would just need to be rebuilt periodically with new releases of the R notebook or Sleuth.
For reference there's a branch of rhdf5 available at https://github.com/grimbough/rhdf5/tree/system_lib which should allow someone to specify the location of their own HDF5 library, rather than relying on the default version distributed in Rhdf5lib.
Same issue here, can't make this work in a new installed R with the latests versions of everything. I am having a lot of troubles getting the different packages to work in earlier versions.
Can you make the h5 file you're trying to read available somewhere? Maybe also include the output of sessionInfo()
so we can see exactly what versions of the various packages you're using.
Same issue here with mro-base=3.5.1 installed using conda (hdf5=1.10.3, kallisto=0.45.0). All R packages where installed within R, not using conda itself.
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : HDF5. Dataset. Read failed.
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)
Matrix products: default
BLAS: /sc/orga/projects/LOAD/Dado/miniconda3/lib/R/lib/libRblas.so
LAPACK: /sc/orga/projects/LOAD/Dado/miniconda3/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sleuth_0.30.0 magrittr_1.5 forcats_0.4.0
[4] stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.0
[7] readr_1.3.1 tidyr_0.8.2 tibble_2.0.1
[10] ggplot2_3.1.0 tidyverse_1.2.1 RevoUtils_11.0.1
[13] RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.1
[5] plyr_1.8.4 tools_3.5.1 rhdf5_2.26.2 jsonlite_1.6
[9] lubridate_1.7.4 gtable_0.2.0 nlme_3.1-137 lattice_0.20-38
[13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
[17] parallel_3.5.1 haven_2.1.0 withr_2.1.2 xml2_1.2.0
[21] httr_1.4.0 generics_0.0.2 hms_0.4.2 grid_3.5.1
[25] tidyselect_0.2.5 data.table_1.12.0 glue_1.3.0 R6_2.4.0
[29] fansi_0.4.0 readxl_1.3.0 Rhdf5lib_1.4.2 modelr_0.1.4
[33] backports_1.1.3 scales_1.0.0 rvest_0.3.2 assertthat_0.2.0
[37] colorspace_1.4-0 utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1
[41] munsell_0.5.0 broom_0.5.1 crayon_1.3.4```
Hi @channsoden,
For you and for any other user who comes here with a similar problem: if a user built R and rhdf5/rhdf5lib using bioconda, your issue may be related to the discussion on issue #120. There was a ton of work to uncover the problem (specifically, this comment proposed the underlying problem), and we think @johanneskoester's latest bioconda recipe for kallisto and rhdf5 are able to handle these issues. If you are still having issues, please try installing the latest bioconda recipe of these and try it out!
To clarify @warrenmcg, when using the Bioconda packages, we are talking about installation. There is no building involved anymore, because they are precompiled by us and you just install the binaries.
FWIW, I resolved this problem by backdating rhdf5 to grimbough/rhdf5@a9824faadf2bc9f2e23ba647aa77ed9f25236f9f and linking against HDF5 v1.8.19 (system install, not via Rhdf5lib).
sleuth_prep seems to always error with "subscript out of bounds" during step "normalizing est_counts". I'm trying this on my own data since, as others have pointed out, the link to the sample data is broken.
Any advice on troubleshooting this?