vjcitn / restfulSE

0 stars 0 forks source link

`RESTfulSummarizedExperiment` usage #1

Open multimeric opened 2 years ago

multimeric commented 2 years ago

The RESTfulSummarizedExperiment class seems to have a hidden constructor, and neither the class nor its methods are documented in the manual. How then are users supposed to use it with an arbitrary HDF5 server? Is this functionality still incomplete?

My use case is that I want to be able to host a server (using HSDS or otherwise) that will serve HDF5 data, and connect it with an R client.

vjcitn commented 2 years ago

The basic tools for working with HSDS are provided in the rhdf5client package. Install it in R 4.2 and then use

library(rhdf5client)
example(HSDSArray)

to see

HSDSAr> if (check_hsds()) {
HSDSAr+  HSDSArray(URL_hsds(), 
HSDSAr+     "hsds", "/shared/bioconductor/darmgcls.h5", "/assay001")
HSDSAr+ }
<65218 x 3584> matrix of class HSDSMatrix and type "double":
                [,1]        [,2]        [,3] ...    [,3583]    [,3584]
    [1,]    0.000000    0.000000  112.394374   .    0.00000    0.00000
    [2,]    0.000000    0.000000    0.000000   .    0.00000    0.00000
    [3,]    0.000000    0.000000    0.000000   .    0.00000    0.00000
    [4,]    5.335452   11.685833    0.000000   .    0.00000   14.01612
    [5,]    0.000000    0.000000    0.000000   .    0.00000    0.00000
     ...           .           .           .   .          .          .
[65214,]     0.00000     0.00000     0.00000   .    0.00000    0.00000
[65215,]   480.68946  1228.13851   112.75566   .    0.00000    0.00000
[65216,]     0.00000     0.00000     0.00000   .    0.00000    0.00000
[65217,]     0.00000   610.82997    46.86639   .    0.00000    0.00000
[65218,] 10155.80336 25366.30099  2068.63983   .    4.01555 2531.88862

This works because the darmgcls.h5 has a matrix in hdf5 element "assay001". We have not found many users of this package; the python h5pyd library could be interfaced to R via reticulate. If you would like to discuss rhdf5client and restfulSE packages further we can arrange a call.

multimeric commented 2 years ago

I see, so we can query remote HDF5 files using rhdf5client, that's a helpful starting point. We were hoping to attach remote HDF5 files as assays within a SingleCellExperiment. Is that a goal for this package? How far away from this goal is it?

vjcitn commented 2 years ago

I think it is not far off at all. Let me see if I can get it going with a public example and I will get back to you.

vjcitn commented 2 years ago

Here is an approach.

The data from example(SingleCellExperiment) (the object produced by the final coercion in the example) were extracted and placed in HDF5 using HDF5Array::saveHDF5SummarizedExperiment. That hdf5 file was placed in our HSDS on ACCESS Jetstream2 using hsload, and hsacl was set to allow default user read access.

Now

>   library(rhdf5client)
>   arr = HSDSArray("http://hsdsdev.bioconductor.org",
+        "hsds", "/shared/bioconductor/litsc.h5", "/assay001")
> 
> arr
<200 x 100> matrix of class HSDSMatrix and type "double":
         [,1]   [,2]   [,3] ...  [,99] [,100]
  [1,]      3      6      3   .      4      7
  [2,]      6      2      7   .      7      3
  [3,]      4      6      7   .      5      4
  [4,]      5      7      4   .      3     10
  [5,]      6      8      5   .     13      2
   ...      .      .      .   .      .      .
[196,]      4      5      6   .      7      6
[197,]      3      6      5   .      3      5
[198,]      3      3      2   .      9      5
[199,]      4      5      4   .      1      7
[200,]      3      6      5   .      5      3

should work for you. Send error message if it does not; you just need a working installation of rhdf5client in R.

Then you place a reference to this as a component of the assays element of a SingleCellExperiment.

> mysce = as(se, "SingleCellExperiment")
> assays(mysce) = SimpleList(counts=arr)
> mysce
class: SingleCellExperiment 
dim: 200 100 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
> assay(mysce)
<200 x 100> matrix of class HSDSMatrix and type "double":
         [,1]   [,2]   [,3] ...  [,99] [,100]
  [1,]      3      6      3   .      4      7
  [2,]      6      2      7   .      7      3
  [3,]      4      6      7   .      5      4
  [4,]      5      7      4   .      3     10
  [5,]      6      8      5   .     13      2
   ...      .      .      .   .      .      .
[196,]      4      5      6   .      7      6
[197,]      3      6      5   .      3      5
[198,]      3      3      2   .      9      5
[199,]      4      5      4   .      1      7
[200,]      3      6      5   .      5      3
> 
vjcitn commented 2 years ago

@multimeric ^^

multimeric commented 2 years ago

Thanks Vince, this looks very promising. Out of interest, can you explain the behaviour of HSDSMatrix in terms of caching data? We were thinking that there might be some risk of requesting the same HDF5 data multiple times from the server unless there is a local caching mechanism.

vjcitn commented 2 years ago

I think there is caching at the server. I am tagging @jreadey who is the lead HSDS developer. If you want to have caching at the R side, that could be accomplished in various ways but we have not addressed this. Have you used HSDS with python, is there an example of a caching discipline there? Or with other cloud numerical store strategies like tensorstore?

multimeric commented 2 years ago

Unfortunately not, I'm just exploring HSDS for the first time as a solution to the problem of having remote HDF5 data. Out of interest, should I be asking this question in the rhdf5client repo? Does restfulSE relate to this use case, in the end?

jreadey commented 2 years ago

hi @multimeric - about caching: h5pyd does cache all the file metadata by default. So you'll notice after h5pyd.File("/myfolder/myfile.h5") open all calls for attributes, links, etc. are very fast since the client doesn't need to go back to the server.

We haven't implement a cache for dataset data - partly because we haven't gotten around to it, partly it seemed less useful (many apps will just visit a given selection once).

HSDS has both a metadata and a chunk cache that is shared by all clients. This makes it possible to avoid reading from storage repeatedly. Using AWS S3, I generally see a 2x speed up if the chunk data can be found in the cache.

Curious, will your client be running in the same datacenter as HSDS or remotely? For the former, bandwidth should high enough that you'll get good performance even without a local data cache.

multimeric commented 2 years ago

Thanks, that's very helpful!

h5pyd does cache all the file metadata by default.

Does this also apply to the R library? Or do you mean that R is using h5pyd behind the scenes?

We haven't implement a cache for dataset data - partly because we haven't gotten around to it, partly it seemed less useful (many apps will just visit a given selection once).

This might be something I can try to help with, if the library otherwise fits my use case.

Curious, will your client be running in the same datacenter as HSDS or remotely?

We're currently thinking the R client will be remote, while the HSDS server would be in cloud. So caching would be important.

jreadey commented 2 years ago

Not sure about R - @vjcitn will need to weigh in.

I'd suggest doing some benchmarking to see what the baseline performance is. I can advice on adding h5pyd caching if that is indicated.

vjcitn commented 2 years ago

At this time we do not interface to h5pyd to use HSDS, We build the GET requests directly in rhdf5client. There have been many improvements to R:python interfacing since we started rhdf5client and so we will take a look at this ASAP. BTW the default mode of transfer with rhdf5client is JSON. It should be binary but the developer left before that could be corrected. I may update rhdf5client to address this soon and will notify. And maybe we should continue the discussion at https://github.com/vjcitn/rhdf5client

multimeric commented 2 years ago

And maybe we should continue the discussion at https://github.com/vjcitn/rhdf5client

Can you please enable github issues on that repo so that we can do so?

vjcitn commented 2 years ago

done