rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
33 stars 23 forks source link

Evaluation rhdf5client to provide an remote HDF5 backend #84

Open sgibb opened 4 years ago

sgibb commented 4 years ago

https://bioconductor.org/packages/release/bioc/html/rhdf5client.html

lgatto commented 4 years ago

Here's a comment by Vince

the rhdf5client function HSDSArray is the key user-level interface ... there are too many symbols exported there

confirming our discussions earlier today.

There's also this paper this paper that demonstrates what Vince demonstrated in his talk

https://f1000research.com/articles/8-21

Link to the HSDS Cloud-native, service based access to HDF data: https://github.com/HDFGroup/hsds

jorainer commented 4 years ago

Would be nice if we could put one of our hdf5 files to the s3 - so we could get started playing around with it...

jorainer commented 4 years ago

Could you eventually try to do that @lgatto ?

lgatto commented 4 years ago

@vjcitn - we would like to do some testing with hdf5client to evaluate if it could be used for a remote backend for MS data. First is to upload an h5 file containing some testing MS data to an S3 bucker. Does this quick start guide provide to correct information for us to get started?

vjcitn commented 4 years ago

yes

On Wed, Dec 11, 2019 at 7:21 PM Laurent Gatto notifications@github.com wrote:

@vjcitn https://github.com/vjcitn - we would like to do some testing with hdf5client to evaluate if it could be used for a remote backend for MS data. First is to upload an h5 file containing some testing MS data to an S3 bucker. Does this quick start guide https://docs.aws.amazon.com/quickstarts/latest/s3backup/welcome.html provide to correct information for us to get started?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84?email_source=notifications&email_token=ABDI5QU5NG7D2WWFRJIELA3QYF7YTA5CNFSM4JZSD2IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGVA5DY#issuecomment-564792975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QV2JTOVI6W2JEU22BLQYF7YTANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

vjcitn commented 4 years ago

put in us west region if possible

On Wed, Dec 11, 2019 at 7:21 PM Laurent Gatto notifications@github.com wrote:

@vjcitn https://github.com/vjcitn - we would like to do some testing with hdf5client to evaluate if it could be used for a remote backend for MS data. First is to upload an h5 file containing some testing MS data to an S3 bucker. Does this quick start guide https://docs.aws.amazon.com/quickstarts/latest/s3backup/welcome.html provide to correct information for us to get started?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84?email_source=notifications&email_token=ABDI5QU5NG7D2WWFRJIELA3QYF7YTA5CNFSM4JZSD2IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGVA5DY#issuecomment-564792975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QV2JTOVI6W2JEU22BLQYF7YTANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

lgatto commented 4 years ago

@jorainer @sgibb - I have uploaded 2 files (the same one twice, actually - thought it would be good for testing).

Etag: e0f9ea269e326e4bdea01528772f52a2
Key: MS3TMT11_1.h5
Object URL: https://spectra-h5.s3-us-west-2.amazonaws.com/MS3TMT11_1.h5 
Etag: cfa531fcd1aff82c1d047cd819c4bb00
Key: MS3TMT11_2.h5
Object URL: https://spectra-h5.s3-us-west-2.amazonaws.com/MS3TMT11_2.h5 

These were created using

> msdata::proteomics(full.names = TRUE)[3]
[1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/3.6/msdata/proteomics/MS3TMT11.mzML"

and

> f <- dir("~/tmp/h5/", full.names = TRUE)
> f
[1] "/home/lgatto/tmp/h5//MS3TMT11_1.mzML"
[2] "/home/lgatto/tmp/h5//MS3TMT11_2.mzML"
> sp <- Spectra(f, backend = MsBackendMzR())
> sp
MSn data (Spectra) with 1988 spectra in a MsBackendMzR backend:
       msLevel            rtime scanIndex
     <integer>        <numeric> <integer>
1            1   2727.070735167         1
2            2 2727.09062595102         2
3            2 2727.17784358302         3
4            3   2727.345348687         4
5            2 2727.35571684702         5
...        ...              ...       ...
1984         3   2825.606527104       990
1985         3 2825.75277308802       991
1986         3 2825.90589132798       992
1987         3     2826.0396672       993
1988         3 2826.17951239998       994
 ... 31 more variables/columns.

file(s):
MS3TMT11_1.mzML
MS3TMT11_2.mzML
Processing:

> sp2 <- setBackend(sp, MsBackendHdf5Peaks())

and I then uploaded the two h5 files.

vjcitn commented 4 years ago

with current rhdf5client, here is an example, but see caveats below

HSDSArray(URL_hsds(), "hsds", "/home/stvjc/MS3TMT11_1.h5", "/spectra/990")

<214 x 2> matrix of class HSDSMatrix and type "double":

         [,1]       [,2]

[1,] 100.0966 7035.4971

[2,] 100.1031 7186.1572

[3,] 101.0709 18605.7852

[4,] 101.1001 9168.8242

[5,] 101.1061 4370.3242

... . .

[210,] 469.2802 3387.1150

[211,] 495.2637 1547.3596

[212,] 495.2878 1652.1604

[213,] 496.2704 54173.2305

[214,] 497.2739 6693.9058

For this to work, I had to load the HDF5 file (which I downloaded from

your bucket) into my personal HSDS domain. Be sure "listObjects" permission

is public for the S3 bucket that you set up. I will now start a thread with

John Readey so that we can all get clear on how to best use HDF Cloud to

work with your spectra.

On Thu, Dec 12, 2019 at 2:12 AM Laurent Gatto notifications@github.com wrote:

@jorainer https://github.com/jorainer @sgibb https://github.com/sgibb

  • I have uploaded 2 files (the same one twice, actually - thought it would be good for testing).

Etag: e0f9ea269e326e4bdea01528772f52a2 Key: MS3TMT11_1.h5 Object URL: https://spectra-h5.s3-us-west-2.amazonaws.com/MS3TMT11_1.h5

Etag: cfa531fcd1aff82c1d047cd819c4bb00 Key: MS3TMT11_2.h5 Object URL: https://spectra-h5.s3-us-west-2.amazonaws.com/MS3TMT11_2.h5

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84?email_source=notifications&email_token=ABDI5QU4SRA4A42MMPAVHRTQYGFX7A5CNFSM4JZSD2IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGVDYTY#issuecomment-564804687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QQVLF2QREY6L6NHLZ3QYGFX7ANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

lgatto commented 4 years ago

@vjcitn - thank you for your help and tests. This looks very promising. I have now set up List Objects for everyone.

jorainer commented 4 years ago

Thanks @vjcitn !

I'd suggest we should implement a new MsBackendHdf5client backend for this. We should now just ensure that we're not all working simultaneously on this to avoid redundant implementations. I've no time for this until next Thursday - so I guess one of you @lgatto or @sgibb will start implementing?

sgibb commented 4 years ago

I wouldn't have time till Monday. If I find time to start experimenting with it I would leave a message here.

lgatto commented 4 years ago

This isn't going to be the same type of backend as we've had so far. This requires an HSDS server (that we don't have write access to) and requires data to be pre-uploaded, and we need the url of the server and the names of the remote files in advance. I rather see this for some pre-annotated/configured large datasets of interest.

Unless we can upload files hdf5 files ourselves directly from the backend. That would be really cool, of course.

So I think we need to hear back from John (from the hfd group), or consider setting an HSDS server up ourselves (if possible?).

vjcitn commented 4 years ago

On Fri, Dec 13, 2019 at 3:05 PM Laurent Gatto notifications@github.com wrote:

This isn't going to be the same type of backend as we've had so far. This requires an HSDS server https://github.com/HDFGroup/hsds (that we don't have write access to) and requires data to be pre-uploaded, and we need the url of the server and the names of the remote files in advance. I rather see this for some pre-annotated/configured large datasets of interest.

I am trying to get clarification on the options ... in principle, once the HDF5 data are in AWS S3 buckets, the python or C APIs could interrogate it without HSDS server. The HSDS gives opportunities for multiplexed reads/writes. I don't know why John has not written back yet -- he is usually answering quickly. Could be on vacation.

Unless we can upload files hdf5 files ourselves directly from the backend. That would be really cool, of course.

So I think we need to hear back from John (from the hfd group), or consider setting an HSDS server up ourselves (if possible?).

HSDS is open source and can apparently be deployed on OpenStack. However the open version will typically be a bit behind the subscription version. Thus far our use of HSDS has been subsidized by HDF group and I think your initial proteomics resources can go into /shared/bioconductor once John is back on line.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84?email_source=notifications&email_token=ABDI5QQRUINFWRCVSMUCVJDQYPTHLA5CNFSM4JZSD2IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG3DL6A#issuecomment-565589496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QTHR26567ILDALNIGLQYPTHLANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

jorainer commented 4 years ago

For me it would be interesting to use the server locally on our cluster - so that I could access our 7000+ raw files. Seems HSDS needs the data to be in AWS S3 buckets - but there is also the h5serv - I did however not understand if that is similar (same?). @vjcitn do you know what the difference between hsds and h5serv is?

jorainer commented 4 years ago

@sgibb and @lgatto, I setup a MsBackendHSDS repository and will give a first shot at a possible implementation.

vjcitn commented 4 years ago

I wanted to let you know that John Readey has been in touch and is looking at the approaches to recommend.

On Tue, Dec 17, 2019 at 3:19 PM Johannes Rainer notifications@github.com wrote:

@sgibb https://github.com/sgibb and @lgatto https://github.com/lgatto, I setup a MsBackendHSDS https://github.com/rformassspectrometry/MsBackendHSDS repository and will give a first shot at a possible implementation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84?email_source=notifications&email_token=ABDI5QVELN6KH437SHONBFTQZEX47A5CNFSM4JZSD2IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHD2NMI#issuecomment-566732465, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QRQVBUNVSM2X3C4FKDQZEX47ANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

vjcitn commented 4 years ago

here is a python gist showing how to manipulate HDF5 in S3

https://gist.github.com/jreadey/b6d0fff8f86e1c2292c729d3d7c8916e

Substitutions to work with the proteomics data should be straightforward ... more later

KilianMaes commented 3 years ago

Hello, I'll work on the package in the next days/weeks in order to propose a functional implementation for the HSDS backend (@lgatto told me this feature was still relevant).

For me it would be interesting to use the server locally on our cluster - so that I could access our 7000+ raw files. Seems HSDS needs the data to be in AWS S3 buckets - but there is also the h5serv - I did however not understand if that is similar (same?). @vjcitn do you know what the difference between hsds and h5serv is?

I spent some time trying to understand what the difference is and found the answer in these John Readey's slides. Correct me if I'm wrong, but that's what I understand. h5serv was launched to demonstrate the functioning of the RESTful API, and HSDS is its successor. HSDS is an "improved version" (load balancing, object storage...) but is compatible with the API tested in h5serv.

During my tests, I'll make sure that the backend works both with a local h5serv and the S3 (that supports HSDS).

Do you have any suggestion/remark?

vjcitn commented 3 years ago

On Tue, Aug 25, 2020 at 9:53 AM Kilian Maes notifications@github.com wrote:

Hello, I'll work on the package in the next days/weeks in order to propose a functional implementation for the HSDS backend (@lgatto https://github.com/lgatto told me this feature was still relevant).

For me it would be interesting to use the server locally on our cluster - so that I could access our 7000+ raw files. Seems HSDS needs the data to be in AWS S3 buckets - but there is also the h5serv https://github.com/HDFGroup/h5serv - I did however not understand if that is similar (same?). @vjcitn https://github.com/vjcitn do you know what the difference between hsds and h5serv is?

h5serv defines a locally deployable tornado-based RESTful API for queries to HDF5 files -- I am not sure that it is actively maintained

HSDS is a more general RESTful API. It should be deployable on any storage system satisfying the S3 object-store protocol. AWS S3 is just one convenient example. CEPH object store that should be available with OpenStack is another example. You might follow up with John Readey directly.

I spent some time trying to understand what the difference is and found the answer in these John Readey's slides https://www.hdfgroup.org/wp-content/uploads/2019/09/ServerBasedHDF.pdf. Correct me if I'm wrong, but that's what I understand. h5serv was launched to demonstrate the functioning of the RESTful API, and HSDS is its successor. HSDS is an "improved version" (load balancing, object storage...) but is compatible with the API tested in h5serv.

During my tests, I'll make sure that the backend works both with a local h5serv and the S3 (that supports HSDS).

Do you have any suggestion/remark?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rformassspectrometry/Spectra/issues/84#issuecomment-680039428, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDI5QWNZT3JGLUYMS5UWYLSCO67PANCNFSM4JZSD2IA .

-- The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance  HelpLine at http://www.partners.org/complianceline http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

KilianMaes commented 3 years ago

I will not have the possibility, as I expected, to work on a pull request until at least next summer, so if anyone else has this possibility does not hesitate.