hansenlab / minfi

Devel repository for minfi
60 stars 70 forks source link

HDF5: Setup HDF5 testing #204

Open kasperdanielhansen opened 4 years ago

kasperdanielhansen commented 4 years ago

We need to test our HDF5 code. An important part is to assess scalability which we will do by running code with different number of samples and look at runtime as a function of samples.

We need a Google Drive directory (with subdirectories) containing

  1. (say) IDAT files for 50 samples
  2. An R script converting the 50 samples into RGChannelSets which are stored in HDF5. There should be multiple files representing 5, 10, 15, 20, 50 samples (5 and 10 are important for quick prototyping). The sample size and chunk dim should be stored in the file name.
  3. An R script restoring (loading) the files into an R session

Possible file name convention: mData5_cdim10x100 (5 samples, chunkdim 10x100)

Useful functions: saveHDF5SummarizedExperiment (and load) from HDF5Array.

ttriche commented 4 years ago

I’ll do it (or get someone on my team to)

kasperdanielhansen commented 4 years ago

You're now assigned :)

ttriche commented 4 years ago

see https://github.com/trichelab/h5testR for implementations of the above using minfiData and TARGET pAML IDATs for 450K and EPIC arrays respectively. The latter scale up to 500 arrays.

ttriche commented 4 years ago

Note that the h5testR examples explicitly load in-core and out-of-core RGsets of whichever IDATs are requested, then save the HDF5 version, overwrite its symbol with a loaded version from the save, and use verifyRGsets to test that a chunk of the values in the per-channel matrices are identical to those in the corresponding in-core representation. Part of the motivation for this is to extend the test to restfulSE representations that live in Amazon AWS-backed HSDS representations.

ttriche commented 4 years ago

A link to the appropriate place in the Google Drive for this would be handy. minfi:::read.metharray2() has a quirk where it will fail if the directory for storing the hdf5 files does not exist (this is the OPPOSITE situation as for saveHDF5SummarizedExperiment, where the save will fail if the target directory DOES exist and replace=TRUE is not set).

Hence the wrapper functions read.methd5() and read.methdf5.sheet(). In practice, these should probably be called write.methdf5() and have a counterpart write.methRestful (or some such)

kasperdanielhansen commented 4 years ago

That's a very useful comment and points out a bug I would say in read.metharray2.

What I would love is a google Drive folder with 50 IDATs and a set of summarized experiments. I get that the posted code makes this objective easier to accomplish, but its not completely there.

ttriche commented 4 years ago

Ah, this is much easier — I’ll grab the first 5, 10, 25, 50 IDATs from TARGET and drop them in the google drive along with their in- and out-of-core RGsets then. Easy enough.

As a bonus, all four of those can be linked with “holes” against their RNAseq data to demonstrate the issue with multiassayexperiment objects backed by HDF5.

--t

On May 26, 2020, at 1:31 PM, Kasper Daniel Hansen notifications@github.com wrote:

 That's a very useful comment and points out a bug I would say in read.metharray2.

What I would love is a google Drive folder with 50 IDATs and a set of summarized experiments. I get that the posted code makes this objective easier to accomplish, but its not completely there.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.