subsetting kallisto object using subset_kallisto()

agropyron commented 6 years ago

Hi, @warrenmcg's advice in Issue #201 to subset files (below) does not work for me

kal <- sleuth::read_bootstrap(kal_path, read_bootstrap = TRUE) ids <- c("ENSMUST00000000001", "ENSMUST00000000003", "ENSMUST00000000129") subset_kal <- sleuth::subset_kallisto(kal, target_ids = ids) write_kallisto_hdf5(subset_kal, fname = "subset_kal.h5")

For importing I used _readkallisto - I think _readbootstrap is a typo. Worked. _subsetkallisto(kal, target_ids=ids) results in a file with the correct number of rows (616) in kal$bootstrap but with zero rows in kal$abundance. I proceeded through_write_kallistohdf5 and, for a sanity check, reimported the subsetted files using _readkallisto, which fails (understandably) with the error below.

Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter. Found 100 bootstrap samples Error in $<-.data.frame(*tmp*, "est_counts", value = c(776, 4054, : replacement has 616 rows, data has 0

When I use a custom function to subset the files, providing additional attributes (original_num_targets, num_targets, excluded_ids) in the same way _subsetkallisto does, I get the correct number of rows in kal$abundance and kal$bootstrap. However, I can't cheat sleuth into accepting the new attributes: _is_kallistosubset does not report the file as subsetted.

Responsible is either _write_kallistohdf5 or _read_kallistoh5. I think part of the problem lies with the below lines in _read_kallistoh5, but I can't figure out why the excluded_ids are not shown when calling summary(kal)

attr(res, 'num_targets') <- nrow(abund) ##ok attr(res, 'original_num_targets') <- nrow(abund) ##importing again the number rows instead of the attribute original_num_targets written while subsetting file

laurabogar commented 5 years ago

I'm running into the same issue with my own analysis right now! I'm hoping to compare just a subset of genes across several treatments, and have used subset_kallisto() to produce kallisto objects that include only my genes of interest. As @agropyron found, however, write_kallisto_hdf5() does NOT capture the "est_counts" information from the original kallisto object. (The hdf5 produced by write_kallisto_hdf5() lacks information in the "est_counts" dataset.)

I would love to avoid having to do all this subsetting manually. Any suggestions would be very welcome!

warrenmcg commented 5 years ago

Hi @agropyron and @laurabogar,

I think I found the offending line: https://github.com/pachterlab/sleuth/blob/24ca2859f13d1eac588d6dd2ac90bf860b4fe07e/R/kallisto.R#L121 This should be the following:

new_obj$abundance <- new_obj$abundance[which(new_obj$abundance$target_id %in% target_ids), ]

I can make a quick patch, but it's not clear when we will be able to get a new stable release out. I would encourage you to make the change yourself, and work with a modified sleuth package. The best way to do it is to clone the repo, make the change, then pull up R and then do the following line:

git clone https://github.com/pachterlab/sleuth.git
## make the change to sleuth/R/kallisto.R using whatever text editor you use
## for example: `emacs sleuth/R/kallisto.R`
## then load up R, and install the package in a separate location from your main library
## my favorite testing library directory is "~/test_R_lib", but it can be any directory
## just make sure the directory exists before running this line of code
R-session:> install.packages("sleuth/", repos=NULL, type="source", lib = "/path/to/alt/library")

Then, you would install the modified sleuth like this:

library(sleuth, lib.loc = "/path/to/alt/library")

It should work then. If you continue to have problems (or if you have problems with doing the above instructions), please report back! Thanks!

pachterlab / sleuth

subsetting kallisto object using subset_kallisto() #204