Open multimeric opened 2 years ago
The basic tools for working with HSDS are provided in the rhdf5client package. Install it in R 4.2 and then use
library(rhdf5client)
example(HSDSArray)
to see
HSDSAr> if (check_hsds()) {
HSDSAr+ HSDSArray(URL_hsds(),
HSDSAr+ "hsds", "/shared/bioconductor/darmgcls.h5", "/assay001")
HSDSAr+ }
<65218 x 3584> matrix of class HSDSMatrix and type "double":
[,1] [,2] [,3] ... [,3583] [,3584]
[1,] 0.000000 0.000000 112.394374 . 0.00000 0.00000
[2,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
[3,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
[4,] 5.335452 11.685833 0.000000 . 0.00000 14.01612
[5,] 0.000000 0.000000 0.000000 . 0.00000 0.00000
... . . . . . .
[65214,] 0.00000 0.00000 0.00000 . 0.00000 0.00000
[65215,] 480.68946 1228.13851 112.75566 . 0.00000 0.00000
[65216,] 0.00000 0.00000 0.00000 . 0.00000 0.00000
[65217,] 0.00000 610.82997 46.86639 . 0.00000 0.00000
[65218,] 10155.80336 25366.30099 2068.63983 . 4.01555 2531.88862
This works because the darmgcls.h5 has a matrix in hdf5 element "assay001". We have not found many users of this package; the python h5pyd library could be interfaced to R via reticulate. If you would like to discuss rhdf5client and restfulSE packages further we can arrange a call.
I see, so we can query remote HDF5 files using rhdf5client
, that's a helpful starting point. We were hoping to attach remote HDF5 files as assays within a SingleCellExperiment
. Is that a goal for this package? How far away from this goal is it?
I think it is not far off at all. Let me see if I can get it going with a public example and I will get back to you.
Here is an approach.
The data from example(SingleCellExperiment)
(the object produced by the final coercion in the example) were extracted and placed in HDF5 using HDF5Array::saveHDF5SummarizedExperiment. That hdf5 file was
placed in our HSDS on ACCESS Jetstream2 using hsload, and hsacl was set to allow default user read access.
Now
> library(rhdf5client)
> arr = HSDSArray("http://hsdsdev.bioconductor.org",
+ "hsds", "/shared/bioconductor/litsc.h5", "/assay001")
>
> arr
<200 x 100> matrix of class HSDSMatrix and type "double":
[,1] [,2] [,3] ... [,99] [,100]
[1,] 3 6 3 . 4 7
[2,] 6 2 7 . 7 3
[3,] 4 6 7 . 5 4
[4,] 5 7 4 . 3 10
[5,] 6 8 5 . 13 2
... . . . . . .
[196,] 4 5 6 . 7 6
[197,] 3 6 5 . 3 5
[198,] 3 3 2 . 9 5
[199,] 4 5 4 . 1 7
[200,] 3 6 5 . 5 3
should work for you. Send error message if it does not; you just need a working installation of rhdf5client in R.
Then you place a reference to this as a component of the assays element of a SingleCellExperiment.
> mysce = as(se, "SingleCellExperiment")
> assays(mysce) = SimpleList(counts=arr)
> mysce
class: SingleCellExperiment
dim: 200 100
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
> assay(mysce)
<200 x 100> matrix of class HSDSMatrix and type "double":
[,1] [,2] [,3] ... [,99] [,100]
[1,] 3 6 3 . 4 7
[2,] 6 2 7 . 7 3
[3,] 4 6 7 . 5 4
[4,] 5 7 4 . 3 10
[5,] 6 8 5 . 13 2
... . . . . . .
[196,] 4 5 6 . 7 6
[197,] 3 6 5 . 3 5
[198,] 3 3 2 . 9 5
[199,] 4 5 4 . 1 7
[200,] 3 6 5 . 5 3
>
@multimeric ^^
Thanks Vince, this looks very promising. Out of interest, can you explain the behaviour of HSDSMatrix
in terms of caching data? We were thinking that there might be some risk of requesting the same HDF5 data multiple times from the server unless there is a local caching mechanism.
I think there is caching at the server. I am tagging @jreadey who is the lead HSDS developer. If you want to have caching at the R side, that could be accomplished in various ways but we have not addressed this. Have you used HSDS with python, is there an example of a caching discipline there? Or with other cloud numerical store strategies like tensorstore?
Unfortunately not, I'm just exploring HSDS for the first time as a solution to the problem of having remote HDF5 data. Out of interest, should I be asking this question in the rhdf5client
repo? Does restfulSE
relate to this use case, in the end?
hi @multimeric - about caching: h5pyd does cache all the file metadata by default. So you'll notice after h5pyd.File("/myfolder/myfile.h5") open all calls for attributes, links, etc. are very fast since the client doesn't need to go back to the server.
We haven't implement a cache for dataset data - partly because we haven't gotten around to it, partly it seemed less useful (many apps will just visit a given selection once).
HSDS has both a metadata and a chunk cache that is shared by all clients. This makes it possible to avoid reading from storage repeatedly. Using AWS S3, I generally see a 2x speed up if the chunk data can be found in the cache.
Curious, will your client be running in the same datacenter as HSDS or remotely? For the former, bandwidth should high enough that you'll get good performance even without a local data cache.
Thanks, that's very helpful!
h5pyd does cache all the file metadata by default.
Does this also apply to the R library? Or do you mean that R is using h5pyd
behind the scenes?
We haven't implement a cache for dataset data - partly because we haven't gotten around to it, partly it seemed less useful (many apps will just visit a given selection once).
This might be something I can try to help with, if the library otherwise fits my use case.
Curious, will your client be running in the same datacenter as HSDS or remotely?
We're currently thinking the R client will be remote, while the HSDS server would be in cloud. So caching would be important.
Not sure about R - @vjcitn will need to weigh in.
I'd suggest doing some benchmarking to see what the baseline performance is. I can advice on adding h5pyd caching if that is indicated.
At this time we do not interface to h5pyd to use HSDS, We build the GET requests directly in rhdf5client. There have been many improvements to R:python interfacing since we started rhdf5client and so we will take a look at this ASAP. BTW the default mode of transfer with rhdf5client is JSON. It should be binary but the developer left before that could be corrected. I may update rhdf5client to address this soon and will notify. And maybe we should continue the discussion at https://github.com/vjcitn/rhdf5client
And maybe we should continue the discussion at https://github.com/vjcitn/rhdf5client
Can you please enable github issues on that repo so that we can do so?
done
The RESTfulSummarizedExperiment class seems to have a hidden constructor, and neither the class nor its methods are documented in the manual. How then are users supposed to use it with an arbitrary HDF5 server? Is this functionality still incomplete?
My use case is that I want to be able to host a server (using HSDS or otherwise) that will serve HDF5 data, and connect it with an R client.