rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
37 stars 25 forks source link

Bug in filterPrecursorScan() #194

Open lgatto opened 3 years ago

lgatto commented 3 years ago
> library(rpx)
> PXD022816 <- PXDataset("PXD022816")
> (mzmls <- pxget(PXD022816, grep("mzML", pxfiles(PXD022816))[1:2]))
Loading QEP2LC6_HeLa_50ng_251120_01-calib.mzML from cache.
Loading QEP2LC6_HeLa_50ng_251120_02-calib.mzML from cache.
[1] "~/.cache/rpx/207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML"
[2] "~/.cache/rpx/207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML"
> sp <- Spectra(mzmls)
> sp2 <- filterPrecursorScan(sp, 2490)
> length(sp2)
[1] 12
spectraData(sp2)[, c("msLevel", "acquisitionNum", "precScanNum", "dataOrigin")]
msLevel acquisitionNum precScanNum dataOrigin
1 2479 NA 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
2 2480 2479 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
1 2482 NA 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
2 2485 2482 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
1 2486 NA 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
2 2490 2486 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
1 2479 NA 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
1 2480 NA 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
2 2482 2480 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
1 2485 NA 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
2 2486 2485 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
2 2490 2485 207e03973c908_QEP2LC6_HeLa_50ng_251120_02-calib.mzML

Issues

  1. First, the acquisition number for all files are extracted, which is probably not what the user wanted. This could be handled explicitly with a call to filterDataOrigin() if I wanted to extract the set from a single file.
  2. The result is wrong. We get the acquisition numbers 2490 from the two files, their precursor scans 2485 and 2486 (again from both files), but then these recursively extract each other across different files. The correct results should contain 4 scans:
> spectraData(filterPrecursorScan(filterDataOrigin(sp, unique(dataOrigin(sp))[1]), 2490))[, c("msLevel", "acquisitionNum", "precScanNum", "dataOrigin")]
msLevel acquisitionNum precScanNum dataOrigin
1 2486 NA 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
2 2490 2486 207e04bccd3dd_QEP2LC6_HeLa_50ng_251120_01-calib.mzML

and

> spectraData(filterPrecursorScan(filterDataOrigin(sp, unique(dataOrigin(sp))[2]), 2490))[, c("msLevel", "acquisitionNum", "precScanNum", "dataOrigin")]
msLevel acquisitionNum precScanNum dataOrigin
1 2485 NA 13707d04ce6b_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
2 2490 248513707d04ce6b_QEP2LC6_HeLa_50ng_251120_02-calib.mzML

Suggested workarounds

@sgibb @jorainer - any comments/suggestions?

jorainer commented 3 years ago

I think the cleanest solution would be to add an argument f = dataOrigin(object) to the filterPrecursorScan method. Internally, the method should then split the data by this f and call the internal function to filter on each subset. We already use the f = dataOrigin(object) for other function to either perform calculations in parallel or to apply it on a per-file basis.

If OK I would implement that - and ask you @lgatto to confirm the results to be correct then.

lgatto commented 3 years ago

Fixed by PR #206. Thanks @jorainer

jorainer commented 3 years ago

Then I guess we can close the issue now.

lgatto commented 1 year ago

This bug isn't fixed.

Using the default f, i.e. all files is OK

> library(rpx)
> PXD022816 <- PXDataset("PXD022816")
Loading PXD022816 from cache.
> (mzmls <- pxget(PXD022816, grep("mzML", pxfiles(PXD022816))[1:2]))
[1] "/home/lgatto/.cache/R/rpx/3e55cc6b3df7f9_QEP2LC6_HeLa_50ng_251120_01-calib.mzML"
[2] "/home/lgatto/.cache/R/rpx/3e55cc7a53b04f_QEP2LC6_HeLa_50ng_251120_02-calib.mzML"
> sp <- Spectra(mzmls)
> (sp2 <- filterPrecursorScan(sp, 2490)) 
MSn data (Spectra) with 4 spectra in a MsBackendMzR backend:
    msLevel     rtime scanIndex
  <integer> <numeric> <integer>
1         1   915.757      2486
2         2   916.293      2490
3         1   915.252      2485
4         2   915.876      2490
 ... 33 more variables/columns.

file(s):
3e55cc6b3df7f9_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
3e55cc7a53b04f_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
Processing:
 Filter: select parent/children scans for 2490 [Fri Apr  7 23:57:32 2023] 

But of I want to get the results for one file, the f argument get ignores:

> filterPrecursorScan(sp, 2490, f = "/home/lgatto/.cache/R/rpx/3e55cc6b3df7f9_QEP2LC6_HeLa_50ng_251120_01-calib.mzML")
MSn data (Spectra) with 4 spectra in a MsBackendMzR backend:
    msLevel     rtime scanIndex
  <integer> <numeric> <integer>
1         1   915.757      2486
2         2   916.293      2490
3         1   915.252      2485
4         2   915.876      2490
 ... 33 more variables/columns.

file(s):
3e55cc6b3df7f9_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
3e55cc7a53b04f_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
Processing:
 Filter: select parent/children scans for 2490 [Fri Apr  7 23:59:34 2023] 

Indeed, looking at this line, if does get ignored.

If I change it, passing f on that same line:

> filterPrecursorScan(sp, 2490, f = "/home/lgatto/.cache/R/rpx/3e55cc6b3df7f9_QEP2LC6_HeLa_50ng_251120_01-calib.mzML")
Error in i2index(i, length(x), rownames(x@spectraData)) : 
  if i is logical it has to match the length of the object (58907)
In addition: Warning message:
In x[i] <- value[[j]] :
  number of items to replace is not a multiple of replacement length