pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
648 stars 171 forks source link

hdf5 bootstrap representation #44

Open mtmorgan opened 9 years ago

mtmorgan commented 9 years ago

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort order, currently these sort as bs0, bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset, especially by id. This would seem to be a common use case.

ttriche commented 9 years ago

that's cool but if you do

hdf5 <- paste0(path.expand(path), "/", h5file) bootstraps <- h5read(hdf5, "aux/num_bootstrap")

if bootstraps are found, summarize them...

if (bootstraps > 0) { message("Found ", bootstraps, " bootstraps for ", path, ", summarizing...")

grab a column vector for each bootstrap, then bind them all into a

Matrix bsidx <- paste0("bootstrap/bs", seq(0, bootstraps - 1)) boots <- do.call(cbind, lapply(bsidx, function(bs) h5read(hdf5, bs))) res <- Matrix(cbind(rowMedians(boots), rowMads(boots), h5read(hdf5, "aux/eff_lengths")), dimnames=list(transcript=h5read(hdf5, "aux/ids"), c("est_count", "eff_length", "est_count_mad"))) } else { message("No bootstraps found for ", path, ", using recorded est_counts...") res <- Matrix(cbind(h5read(hdf5, "est_counts"), h5read(hdf5, "aux/eff_lengths")), dimnames=list(transcript=h5read(hdf5, "aux/ids"), c("est_count", "eff_length"))) }

it's pretty easy to handle them the way they are

Also if you plot a few of the bootstraps their error typically (not always) ends up looking Gaussian at most detectable thresholds. Sometimes there are obvious multimodal distributions, perhaps transcript-level mclust would usefully inform downstream analysis. Anyways... it is not so hard as-is

Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

On Tue, May 19, 2015 at 3:33 PM, Martin Morgan notifications@github.com wrote:

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort order, currently these sort as bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset, especially by id. This would seem to be a common use case.

— Reply to this email directly or view it on GitHub https://github.com/pachterlab/kallisto/issues/44.

mtmorgan commented 9 years ago

simplify2array(h5read("foo.h5", "/bootstrap")) imports all bootstraps as a matrix so no need to iterate through (see readBootstrap); one is interested in subsets when managing larger data.

pimentel commented 9 years ago

Hi guys,

Thanks for the input.

@mtmorgan I don't see a super compelling reason (other than it being more visually appealing) to use the left padded numbers, and as @ttriche this is why we store num_bootstrap in aux. Is there another reason other than being more visually appealing when sorted?

When we were developing kallisto, I did some non-comprehensive testing and there was little to no difference in storing them in vectors vs storing them in a huge matrix and reading into R. The existing C code is much simpler as currently written.

Which direction do you want to subset the HDF5 in (e.g. by transcript or bootstrap round)?

BTW, I'm actually on vacation right now, but when I return next week I'll post the code we have for reading bootstraps and aggregating kallisto results (this is all part of the sleuth project).

mtmorgan commented 9 years ago

I guess I'm just scared by having my chromosomes sort chr1, chr10, instead of chr1, chr2, ... One can easily imagine thinking that the results of the third bootstrap were interesting, but then confusing bs2 with bs3 with bs10, all of which are 'third' in some sense (if the main consumer of these files is envisioned to be R, then it would also make sense to number from 1 rather than 0).

I think one would often access by bootstrap for some purposes (e.g., normalization-like activities) and by transcript for others (e.g., drilling down on 'interesting' results); this two-dimensional access and the homogeneity of data type makes one think of a matrix, rather than list-of-vectors. It also seems less error-prone to rely on existing software to subset an hdf5 matrix in hdf5, rather than to come up with ad hoc solutions (like here) for sub-setting a list-of-vectors in a client language. A matrix would also make padding bootstrap ids moot.

Obviously these are not earth-shaking features, so please ignore if you like...