lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

Failed to add identification data #548

Closed Geoffreylxd closed 2 years ago

Geoffreylxd commented 2 years ago

Hi, I wanted to use the 'addIdentificationData' to combine the mzID file data with the mzML data stored as msexp following the vignette's example. However, errors showed up and I could not perform this step:

msexp <- readMSData(rawFile, verbose = T, centroided = F)
Reading 40177 MS2 spectra from file 01CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzML.gz
  |======================================| 100%
Creating 'MSnExp' object
identFile <- "CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzid"
msexp <- addIdentificationData(object =  msexp, id = identFile)
Error in (function (cond)  : 
  error in evaluating the argument 'X' in selecting a method for function 'lapply': Could not convert using R function: as.data.frame.

Thanks a lot for your help!

lgatto commented 2 years ago
  1. What is written out to the console after running traceback() right after the error?
  2. What is the output of sessionInfo()?
Geoffreylxd commented 2 years ago

After running the following:

identFile <- "CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzid"
msexp <- addIdentificationData(object =  msexp, id = identFile)
Error in (function (cond)  : 
  error in evaluating the argument 'X' in selecting a method for function 'lapply': Could not convert using R function: as.data.frame.
traceback()
17: (function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "Could not convert using R function: as.data.frame.", 
        call = object@backend$getSpecParams(), cppstack = NULL), class = c("Rcpp::not_compatible", 
    "C++Error", "error", "condition")))
16: stop(structure(list(message = "Could not convert using R function: as.data.frame.", 
        call = object@backend$getSpecParams(), cppstack = NULL), class = c("Rcpp::not_compatible", 
    "C++Error", "error", "condition")))
15: .External(structure(list(name = "CppMethod__invoke_notvoid", 
        address = <pointer: 0x7fec924284c0>, dll = structure(list(
            name = "Rcpp", path = "/Library/Frameworks/R.framework/Versions/4.1/Resources/library/Rcpp/libs/Rcpp.so", 
            dynamicLookup = TRUE, handle = <pointer: 0x7fec9b21fde0>, 
            info = <pointer: 0x7fec944253c0>), class = "DLLInfo"), 
        numParameters = -1L), class = c("ExternalRoutine", "NativeSymbolInfo"
    )), <pointer: 0x7fec89fefce0>, <pointer: 0x7fec9c96b0f0>, .pointer)
14: object@backend$getSpecParams()
13: specParams(object)
12: specParams(object)
11: .local(object, ...)
10: psms(from)
9: psms(from)
8: lapply(x, function(xx) {
       if (is.factor(xx)) 
           as.character(xx)
       else xx
   })
7: factorsAsStrings(psms(from))
6: asMethod(object)
5: as(id, "data.frame")
4: .addCharacterIdentificationData(object, id, fcol, icol, acc, 
       desc, pepseq, key, decoy, rank, accession, verbose, ...)
3: .local(object, id, ...)
2: addIdentificationData(object = msexp, id = identFile)
1: addIdentificationData(object = msexp, id = identFile)
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics 
[5] grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] mzID_1.30.0         MSnbase_2.18.0     
[3] ProtGenerics_1.24.0 S4Vectors_0.30.0   
[5] mzR_2.26.1          Rcpp_1.0.7         
[7] Biobase_2.52.0      BiocGenerics_0.38.0

loaded via a namespace (and not attached):
 [1] bitops_1.0-7          
 [2] matrixStats_0.59.0    
 [3] bit64_4.0.5           
 [4] doParallel_1.0.16     
 [5] filelock_1.0.2        
 [6] progress_1.2.2        
 [7] httr_1.4.2            
 [8] GenomeInfoDb_1.28.1   
 [9] tools_4.1.0           
[10] utf8_1.2.1            
[11] R6_2.5.0              
[12] affyio_1.62.0         
[13] DBI_1.1.1             
[14] lazyeval_0.2.2        
[15] colorspace_2.0-2      
[16] tidyselect_1.1.1      
[17] prettyunits_1.1.1     
[18] bit_4.0.4             
[19] curl_4.3.2            
[20] compiler_4.1.0        
[21] preprocessCore_1.54.0 
[22] xml2_1.3.2            
[23] plotly_4.9.4.1        
[24] scales_1.1.1          
[25] affy_1.70.0           
[26] rappdirs_0.3.3        
[27] stringr_1.4.0         
[28] digest_0.6.27         
[29] XVector_0.32.0        
[30] pkgconfig_2.0.3       
[31] htmltools_0.5.1.1     
[32] dbplyr_2.1.1          
[33] fastmap_1.1.0         
[34] limma_3.48.1          
[35] htmlwidgets_1.5.3     
[36] rlang_0.4.11          
[37] RSQLite_2.2.7         
[38] impute_1.66.0         
[39] generics_0.1.0        
[40] jsonlite_1.7.2        
[41] BiocParallel_1.26.1   
[42] dplyr_1.0.7           
[43] RCurl_1.98-1.3        
[44] magrittr_2.0.1        
[45] GenomeInfoDbData_1.2.6
[46] MALDIquant_1.19.3     
[47] munsell_0.5.0         
[48] fansi_0.5.0           
[49] MsCoreUtils_1.4.0     
[50] lifecycle_1.0.0       
[51] vsn_3.60.0            
[52] stringi_1.7.3         
[53] MASS_7.3-54           
[54] zlibbioc_1.38.0       
[55] plyr_1.8.6            
[56] BiocFileCache_2.0.0   
[57] rpx_2.0.0             
[58] grid_4.1.0            
[59] blob_1.2.1            
[60] crayon_1.4.1          
[61] lattice_0.20-44       
[62] Biostrings_2.60.1     
[63] hms_1.1.0             
[64] KEGGREST_1.32.0       
[65] pillar_1.6.1          
[66] reshape2_1.4.4        
[67] codetools_0.2-18      
[68] biomaRt_2.48.2        
[69] XML_3.99-0.6          
[70] glue_1.4.2            
[71] pcaMethods_1.84.0     
[72] data.table_1.14.0     
[73] BiocManager_1.30.16   
[74] foreach_1.5.1         
[75] png_0.1-7             
[76] vctrs_0.3.8           
[77] gtable_0.3.0          
[78] purrr_0.3.4           
[79] tidyr_1.1.3           
[80] clue_0.3-59           
[81] assertthat_0.2.1      
[82] cachem_1.0.5          
[83] ggplot2_3.3.5         
[84] ncdf4_1.17            
[85] viridisLite_0.4.0     
[86] tibble_3.1.2          
[87] iterators_1.0.13      
[88] AnnotationDbi_1.54.1  
[89] memoise_2.0.0         
[90] IRanges_2.26.0        
[91] cluster_2.1.2         
[92] ellipsis_0.3.2 
lgatto commented 2 years ago

There is an issue with the reading/parsing of your mzID file. Could you try the following

library(mzR)
id <- openIDfile(identFile)
psms(id)

How did you create that mzID file?

An alternative is to use mzID to parse it. It is slower but more robust in some cases. To go down that route, you would first create a dataframe with the identification data, and then add it with addIdentificationData().

Geoffreylxd commented 2 years ago

Thanks for the reply. The mzID file was downloaded from the CPTAC database.

Using openIDfile() also did not work:

id <- openIDfile(identFile)
id
Identification file handle.
Filename:  CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzid 
Error in (function (cond)  : 
  error in evaluating the argument 'x' in selecting a method for function 'nrow': Could not convert using R function: as.data.frame.
traceback()
15: (function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "Could not convert using R function: as.data.frame.", 
        call = object@backend$getSpecParams(), cppstack = NULL), class = c("Rcpp::not_compatible", 
    "C++Error", "error", "condition")))
14: stop(structure(list(message = "Could not convert using R function: as.data.frame.", 
        call = object@backend$getSpecParams(), cppstack = NULL), class = c("Rcpp::not_compatible", 
    "C++Error", "error", "condition")))
13: .External(structure(list(name = "CppMethod__invoke_notvoid", 
        address = <pointer: 0x7fe38bf75d60>, dll = structure(list(
            name = "Rcpp", path = "/Library/Frameworks/R.framework/Versions/4.1/Resources/library/Rcpp/libs/Rcpp.so", 
            dynamicLookup = TRUE, handle = <pointer: 0x7fe38bd078f0>, 
            info = <pointer: 0x7fe38d4253c0>), class = "DLLInfo"), 
        numParameters = -1L), class = c("ExternalRoutine", "NativeSymbolInfo"
    )), <pointer: 0x7fe37cda3ff0>, <pointer: 0x7fe37cda5330>, .pointer)
12: object@backend$getSpecParams()
11: specParams(object)
10: specParams(object)
9: .local(object, ...)
8: psms(x)
7: psms(x)
6: nrow(psms(x))
5: length(object)
4: length(object)
3: cat("Number of psms: ", length(object), "\n")
2: (new("standardGeneric", .Data = function (object) 
   standardGeneric("show"), generic = structure("show", package = "methods"), 
       package = "methods", group = list(), valueClass = character(0), 
       signature = structure("object", simpleOnly = TRUE), default = new("derivedDefaultMethod", 
           .Data = function (object) 
           showDefault(object), target = new("signature", .Data = "ANY", 
               names = "object", package = "methods"), defined = new("signature", 
               .Data = "ANY", names = "object", package = "methods"), 
           generic = structure("show", package = "methods")), skeleton = (new("derivedDefaultMethod", 
           .Data = function (object) 
           showDefault(object), target = new("signature", .Data = "ANY", 
               names = "object", package = "methods"), defined = new("signature", 
               .Data = "ANY", names = "object", package = "methods"), 
           generic = structure("show", package = "methods")))(object)))(new("mzRident", 
       backend = new("Rcpp_Ident", .xData = <environment>), fileName = "/Users/geoffreyleung/Desktop/Insilico/Proteomics/CPTAC/Brain-CPTAC_GBM_Discovery_Study_S048/01CPTAC_GBM_Proteome_PNNL_20190123_PSM.CAP.r1_mzIdentML/CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzid", 
       .__classVersion__ = new("Versions", .Data = list(c(0L, 0L, 
       1L)))))
1: (new("standardGeneric", .Data = function (object) 
   standardGeneric("show"), generic = structure("show", package = "methods"), 
       package = "methods", group = list(), valueClass = character(0), 
       signature = structure("object", simpleOnly = TRUE), default = new("derivedDefaultMethod", 
           .Data = function (object) 
           showDefault(object), target = new("signature", .Data = "ANY", 
               names = "object", package = "methods"), defined = new("signature", 
               .Data = "ANY", names = "object", package = "methods"), 
           generic = structure("show", package = "methods")), skeleton = (new("derivedDefaultMethod", 
           .Data = function (object) 
           showDefault(object), target = new("signature", .Data = "ANY", 
               names = "object", package = "methods"), defined = new("signature", 
               .Data = "ANY", names = "object", package = "methods"), 
           generic = structure("show", package = "methods")))(object)))(new("mzRident", 
       backend = new("Rcpp_Ident", .xData = <environment>), fileName = "/Users/geoffreyleung/Desktop/Insilico/Proteomics/CPTAC/Brain-CPTAC_GBM_Discovery_Study_S048/01CPTAC_GBM_Proteome_PNNL_20190123_PSM.CAP.r1_mzIdentML/CPTAC_GBM_W_PNNL_20190123_B1S1_f01.mzid", 
       .__classVersion__ = new("Versions", .Data = list(c(0L, 0L, 
       1L)))))
psms(id)
Error in object@backend$getSpecParams() : 
  Could not convert using R function: as.data.frame.

For mzID, I could load the mzID file and flatten it to a dataframe. Both this method and MSnID worked in parsing the mzID file. However, only around ~100 rows were mapped with the mzID file (out of ~40000) by addIdentificationData(). Do you know is it normal or I have made some mistakes?

lgatto commented 2 years ago

MSnID uses mzID for parsing, so it is expected to work.

100 out of 4000 is very low indeed. Have you set the column names correctly?

What does MSnID report in terms of number of PSMs after filtering?

Otherwise, I would recommend you have a look at the new Spectra package, which improves MSnbase on several fronts. You can find details here. The joinSpectraData() function is the equivalent of addIdentificationData(), and is described in section 4.6 of the manual.

Geoffreylxd commented 2 years ago

I found out it should have been the problem of the mzID files. I generated the mzID files again from the .mzML files and it worked fine. Thanks a lot!