HDF5 file backend help - Githubissues

rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra

https://rformassspectrometry.github.io/Spectra/

37 stars 25 forks source link

HDF5 file backend help #198

Closed ococrook closed 2 years ago

ococrook commented 3 years ago

Hi, thanks again for the help. Though I'm still not really getting how to use the HDF5 backend in a meaningful way. I'm clearly misunderstanding something, so any help would be appreciated.

I have my potentially large Spectra object, loaded in as follows:

hoip_apo_1_0_test <- Spectra(fpath, source = MsBackendMzR(), backend = MsBackendHdf5Peaks(), hdf5path = getwd())
hoip_apo_1_0_test

MSn data (Spectra) with 156 spectra in a MsBackendHdf5Peaks backend:
      msLevel     rtime scanIndex
    <integer> <numeric> <integer>
1           2     1.610         1
2           1   120.512         2
3           1   121.027         3
4           1   121.541         4
5           1   122.055         5
...       ...       ...       ...
152         1   197.833       152
153         1   198.347       153
154         1   198.862       154
155         1   199.376       155
156         1   199.890       156

Now I want to look at the stuff in the ms2 scans:

out <- hoip_apo_1_0_test[msLevel(hoip_apo_1_0_test) == 2]
out

MSn data (Spectra) with 2 spectra in a MsBackendHdf5Peaks backend:
    msLevel     rtime scanIndex
  <integer> <numeric> <integer>
1         2      1.61         1
2         2    179.74       117
 ... 33 more variables/columns.

file(s):
 hoip_apo_1_0_test.h5
Processing:
 Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Wed Apr 14 11:42:21 2021]

I realise this is the point where something will go wrong. out is still linked to the .h5 file and so I'm going to mess something up if I try and filter/process object. E.g. I want to look at the first scan and then the second scan do something else

out1 <- filterRt(out, rt = (0,2))
out1 <- filterMzRange(out1, c(700,850))
out2 <- filterRt(out, rt = c(170,180))
out2 <- filterMzRange(out, c(300,400))

All fine an no problem, until:

applyProcessing(out2) #fine
applyProcessing(out1)

Error in .h5_read_peaks(fls, scanIndex(object), object@modCount) : 
  The data in the hdf5 files associated with this object appear to have changed! Please see the Notes section in ?MsBackend for more information.

Also not fine:

filterEmptySpectra(out2)

Error in (function (cond)  : 
  error in evaluating the argument 'i' in selecting a method for function '[': The data in the hdf5 files associated with this object appear to have changed! Please see the Notes section in ?MsBackend for more information.

So currently a pipeline where we search for peptides using known rt and mz ranges are not going to work in this way. My main question is how should this be done? It doesn't seem to make sense to switch backends before filtering because then we just load the object into memory, defeating the purpose.

lgatto commented 3 years ago

I'll have a look tomorrow, but it would be useful if you could provide a small reproducible example.

Small comment - better to use filterMsLevel(hoip_apo_1_0_test, 2) rather than doing it manually with hoip_apo_1_0_test[msLevel(hoip_apo_1_0_test) == 2] because the filter function might be optimised for a dedicated backend.

ococrook commented 3 years ago

Thanks Laurent, good pint about the filterMsLevel - I missed that. I can send you that test file - it's a small .mzML for the purposes of package testing. Though I might have already resolved the issue I'd still be keen to hear your thoughts!

lgatto commented 3 years ago

Yes, send it over.

lgatto commented 3 years ago

I am seeing similar issue (though for out2, not out1), but will need more time to explore/understand.

I assume that the missing c in out1 <- filterRt(out, rt = (0,2)) is a typo, right?

ococrook commented 3 years ago

yes, sorry typo!

lgatto commented 3 years ago

Also, should out2 <- filterMzRange(out, c(300,400)) not be out2 <- filterMzRange(out2, c(300,400))?

ococrook commented 3 years ago

should probably have named these something more sensible to avoid typos

lgatto commented 3 years ago

Here's a more detailed description of the issue:

> library(Spectra)
> library(magrittr)
> rm(list = ls())
> ## make sure the file doesn't already exist
> unlink("hoip_apo_1_0_test.h5")
> ## create data
> out <- Spectra("./hoip_apo_1_0_test.mzML",
+                source = MsBackendMzR(),
+                backend = MsBackendHdf5Peaks(),
+                hdf5path = ".") %>%
+     filterMsLevel(2)
+ > file.size("./hoip_apo_1_0_test.h5") ## 31908861
[1] 31908861
> out@processingQueueVariables ## character(0)
character(0)

Spectra variable filtering

Works as expected - file isn't modified, no processing queue

> out1 <- filterRt(out, rt = c(0,2))  ## first spectrum only
> length(out1)
[1] 1
> file.size("~/tmp/hoip_apo_1_0_test.h5") ## 31908861
[1] 31908861
> out1@processingQueueVariables ## character(0)
character(0)
> out1@backend@modCount 
[1] 0

mz filtering

Things work, the file isn't modified (range is filtered on the fly) but msLevel is added to the processing queue?

But should modCount not be 1?

> range(mz(out1)) 
         [,1]
[1,]  260.019
[2,] 1500.454
> out1 <- filterMzRange(out1, c(700,850))
> range(mz(out1)) ## works
         [,1]
[1,] 700.0046
[2,] 849.7397
> out1@processingQueueVariables ## msLevel - ok
[1] "msLevel"
> out1@backend@modCount ## should this not be 1?
[1] 0
> file.size("~/tmp/hoip_apo_1_0_test.h5") ## 31908861
[1] 31908861

Process filtering

Apparently work, and file is modified, but msLevel still in processing queue. File has changed indeed, and mz can't be accessed anymore.

> applyProcessing(out1) ## apparently works
MSn data (Spectra) with 1 spectra in a MsBackendHdf5Peaks backend:
    msLevel     rtime scanIndex
  <integer> <numeric> <integer>
1         2      1.61         1
 ... 33 more variables/columns.

file(s):
 hoip_apo_1_0_test.h5
Processing:
 Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Fri Apr 16 11:30:53 2021]
 Filter: select MS level(s) 2 [Fri Apr 16 11:30:53 2021]
 Filter: select retention time [0..2] on MS level(s) 2 [Fri Apr 16 11:31:58 2021]
 ...2 more processings. Use 'processingLog' to list all. 
> file.size("~/tmp/hoip_apo_1_0_test.h5") ## 23772271 
[1] 23772271
> out1@backend@modCount
[1] 0
> out1@processingQueueVariables ## still msLevel - ok
[1] "msLevel"
> range(mz(out1)) ## error
Error in .h5_read_peaks(fls, scanIndex(object), object@modCount) : 
  The data in the hdf5 files associated with this object appear to have changed! Please see the Notes section in ?MsBackend for more information.

@jorainer - could you have a look?

Update: the msLevel is correct, I believe, as it indicates on which MS levels to operate.

ococrook commented 3 years ago

Thanks Laurent!