lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
126 stars 46 forks source link

Error in Data Frame #261

Closed ghost closed 7 years ago

ghost commented 7 years ago

I have followed instructions on the following link on installation and usage of MSnBase. http://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.pdf.

This is the code that I have used for the session:

library(MSnbase)
x<-"E:/Data/Kundai/RStudio/XCMS/Eta6.mzXML"
ms<-openMSfile(x)
hd<-header(ms)
y<-"E:/Data/Kundai/RStudio/XCMS/Eta6.mzid"
#creating basic MSnExp object
msexp<-readMSData(x, verbose = FALSE)
msexp<-addIdentificationData(msexp, id=y, verbose = FALSE).

This is the error message I get:

> library(MSnbase)
> x<-"E:/Data/Kundai/RStudio/XCMS/Eta6.mzXML"
> ms<-openMSfile(x)
> hd<-header(ms)
> y<-"E:/Data/Kundai/RStudio/XCMS/Eta6.mzid"
> msexp<-readMSData(x, verbose = FALSE)
> msexp<-addIdentificationData(msexp, id=y, verbose = FALSE)
 **Error in `$<-.data.frame`(`*tmp*`, spectrumFile, value = c("170511_RF_Eta6.mzML",  : 
  replacement has 2 rows, data has 817**

SessionInfo

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bindrcpp_0.2         randomForest_4.6-12  plotly_4.7.1        
 [4] ggplot2_2.2.1        shiny_1.0.5          data.table_1.10.4   
 [7] dbscan_1.1-1         KPIC_2.4.0           ropls_1.9.0         
[10] devtools_1.13.3      iontree_1.23.0       XML_3.98-1.9        
[13] RSQLite_2.0          rJava_0.9-8          BiocInstaller_1.27.4
[16] msmsEDA_1.15.0       xcms_2.99.7          RColorBrewer_1.1-2  
[19] MSnbase_2.3.10       ProtGenerics_1.9.0   BiocParallel_1.11.6 
[22] mzR_2.11.8           Rcpp_0.12.12         Biobase_2.37.2      
[25] BiocGenerics_0.23.1 
lgatto commented 7 years ago

Thank you for the report. Would you be able to share the mzML and mzid files for me to debug?

ghost commented 7 years ago

Please note that I am using an mzXML file instead of mzML file because Johannes Rainner told me that the readMSData function uses mzR which was designed to accept both mzML and mzXML files. If it is the case that readMSData accepts strictly mzML files, I will try it out with the said format.

Here is a link to dropbox where the two files are stored. They were too big to copy and paste directly. https://www.dropbox.com/s/p0mlpj0acnqr0lv/Eta6.rar?dl=0

lgatto commented 7 years ago

Thanks. I have been travelling and just managed to download the file now - will look at is asap.

lgatto commented 7 years ago

The error that you saw stems from the fact that the mzid files claims two source files, while only one is expected

> id <- openIDfile("~/tmp/Eta6/Eta6.mzid")
> sourceInfo(id)
[1] "C:\\Users\\Student\\Desktop\\170511_RF_Eta\\170511_RF_Eta6.mzML"
[2] "C:\\Users\\Student\\Desktop\\170511_RF_Eta6.raw"

I then ran in a second error due to the lack of any identification scores

> score(id)
No scoring information available
data frame with 0 columns and 0 rows

Both errors are fixed in the latest version, available from github now (and soon also on Bioconductor), but I am wondering whether Proteome Discoverer is doing a good job. I haven't heard of these error before, and I am not very confident in a piece of software that doesn't report its scores - not very good practice when it comes to downstream analysis anyway.

With the latest version, I get

> rw <- readMSData("~/tmp/Eta6/Eta6.mzXML", mode = "onDisk")
> rw <- addIdentificationData(rw, "~/tmp/Eta6/Eta6.mzid")
No scoring information available
lgatto commented 7 years ago

If this now also works for you, could you please close this issue, or report any additional problems you see.

ghost commented 7 years ago

Its still not working for some reason. I do not know why. I downloaded a coat of MSnBase from Github using the following code: library("BiocInstaller") biocLite("MSnbase")

The session info is the exact same as the one shown above. I then ran the following line of code: msexp <- addIdentificationData(msexp, id = identFile,verbose = FALSE) Error in $<-.data.frame(*tmp*, "spectrumFile", value = c("170511_RF_Eta6.mzML", : replacement has 2 rows, data has 817

I suspect that maybe I have the old version without the new improvements. Where can I download the new one ?

lgatto commented 7 years ago

You are right about the old version. You'll need to install the very latest version directly from Github

library("BiocInstaller")
biocLite("lgatto/MSnbase")

Once you confirm that it works, I will push it to Bioconductor.

ghost commented 7 years ago

Here is my session info after installation

sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] xcms_2.99.7 MSnbase_2.3.11 ProtGenerics_1.9.1 BiocParallel_1.11.6 mzR_2.11.11
[6] Rcpp_0.12.12 Biobase_2.37.2 BiocGenerics_0.23.1 BiocInstaller_1.27.4

loaded via a namespace (and not attached): [1] git2r_0.19.0 compiler_3.4.1 RColorBrewer_1.1-2 plyr_1.8.4
[5] iterators_1.0.8 tools_3.4.1 zlibbioc_1.23.0 MALDIquant_1.16.4
[9] digest_0.6.12 memoise_1.1.0 preprocessCore_1.39.3 tibble_1.3.4
[13] gtable_0.2.0 lattice_0.20-35 rlang_0.1.2 Matrix_1.2-10
[17] foreach_1.4.3 curl_2.8.1 knitr_1.17 httr_1.3.1
[21] withr_2.0.0 devtools_1.13.3 S4Vectors_0.15.8 IRanges_2.11.17
[25] stats4_3.4.1 multtest_2.33.0 grid_3.4.1 impute_1.51.0
[29] R6_2.2.2 XML_3.98-1.9 survival_2.41-3 RANN_2.5.1
[33] limma_3.33.10 ggplot2_2.2.1 MASS_7.3-47 splines_3.4.1
[37] scales_0.5.0 pcaMethods_1.69.0 codetools_0.2-15 MassSpecWavelet_1.43.0 [41] mzID_1.15.0 colorspace_1.3-2 affy_1.55.0 lazyeval_0.2.0
[45] munsell_0.4.3 doParallel_1.0.10 vsn_3.45.2 affyio_1.47.1

The error is still there:

msexp <- addIdentificationData(msexp, id = identFile,verbose = FALSE) Error in $<-.data.frame(*tmp*, "spectrumFile", value = c("170511_RF_Eta6.mzML", : replacement has 2 rows, data has 817

lgatto commented 7 years ago

Could you report the output of getMethod("coerce", c("mzRident", "data.frame")). Here's what is expected:

> getMethod("coerce", c("mzRident", "data.frame"))
Method Definition:

function (from, to = "data.frame", strict = TRUE) 
{
    iddf <- factorsAsStrings(psms(from))
    src <- basename(sourceInfo(from))
    if (length(src) > 1) 
        src <- paste(src, collapse = ";")
    iddf$spectrumFile <- src
    iddf$idFile <- basename(fileName(from))
    scores <- factorsAsStrings(score(from))
    if (nrow(scores)) {
        stopifnot(identical(iddf[, 1], scores[, 1]))
        iddf <- cbind(iddf, scores[, -1])
    }
    mods <- factorsAsStrings(modifications(from))
    names(mods)[-1] <- makeCamelCase(names(mods), prefix = "mod")[-1]
    iddf <- merge(iddf, mods, by.x = c("spectrumID", "sequence"), 
        by.y = c("spectrumID", "modSequence"), suffixes = c("", 
            ".y"), all = TRUE, sort = FALSE)
    iddf[, "spectrumID.y"] <- NULL
    subs <- factorsAsStrings(substitutions(from))
    names(subs)[-1] <- makeCamelCase(names(subs), prefix = "sub")[-1]
    iddf <- merge(iddf, subs, by.x = c(spectrumID = "sequence"), 
        by.y = c(spectrumID = "subSequence"), suffixes = c("", 
            ".y"), all = TRUE, sort = FALSE)
    iddf[, "spectrumID.y"] <- NULL
    iddf
}
<environment: namespace:MSnbase>

Signatures:
        from       to          
target  "mzRident" "data.frame"
defined "mzRident" "data.frame"
ghost commented 7 years ago

Here it is:

>  getMethod("coerce", c("mzRident", "data.frame"))
Method Definition:

function (from, to = "data.frame", strict = TRUE) 
{
    iddf <- factorsAsStrings(psms(from))
    iddf$spectrumFile <- basename(sourceInfo(from))
    iddf$idFile <- basename(fileName(from))
    scores <- factorsAsStrings(score(from))
    stopifnot(identical(iddf[, 1], scores[, 1]))
    iddf <- cbind(iddf, scores[, -1])
    mods <- factorsAsStrings(modifications(from))
    names(mods)[-1] <- makeCamelCase(names(mods), prefix = "mod")[-1]
    iddf <- merge(iddf, mods, by.x = c("spectrumID", "sequence"), 
        by.y = c("spectrumID", "modSequence"), suffixes = c("", 
            ".y"), all = TRUE, sort = FALSE)
    iddf[, "spectrumID.y"] <- NULL
    subs <- factorsAsStrings(substitutions(from))
    names(subs)[-1] <- makeCamelCase(names(subs), prefix = "sub")[-1]
    iddf <- merge(iddf, subs, by.x = c(spectrumID = "sequence"), 
        by.y = c(spectrumID = "subSequence"), suffixes = c("", 
            ".y"), all = TRUE, sort = FALSE)
    iddf[, "spectrumID.y"] <- NULL
    iddf
}
<bytecode: 0x000000002c69a710>
<environment: namespace:MSnbase>

Signatures:
        from       to          
target  "mzRident" "data.frame"
defined "mzRident" "data.frame"
lgatto commented 7 years ago

Ok, so for some reason, you are still running the old code. Could you

  1. close and restart R
  2. run `BiocInstaller::biocLite("lgatto/MSnbase")
  3. library("MSnbase") and try you code again
  4. a.if it works, close the issue b. if it fails, report the output of getMethod again.
ghost commented 7 years ago

Thanks a lot. Now works.

ghost commented 7 years ago

Unfortunately, for some reason, I have run into more errors during the course of the night. This might be classified as a question and not necessarily an issue.

experiment <- removePeaks(msexp, t = 400, verbose = FALSE) There were 50 or more warnings (use warnings() to see the first 50)

qnt.max <- normalise(msexp, "max") plot(qnt.max) Error in if (centroided.) { : missing value where TRUE/FALSE needed

sc <- quantify(msexp, method = "count") Error in sampleNames<-(*tmp*, value = character(0)) : 'value' length (0) must equal sample number in AssayData (1)

siquant <- quantify(msexp, method = "SIn") Error in sampleNames<-(*tmp*, value = sampleNames(phenoData)) : 'value' length (0) must equal sample number in AssayData (1)

compareSpectra(centroided[[2]], centroided[[3]], fun = "common") Error in centroided[[2]] : object of type 'closure' is not subsettable compareSpectra(centroided[[2]], centroided[[3]],

Lastly, do the comparison matrices compare any two different spectra or they compare the same peptide spectra in different samples ? Is the clusterdendogram supposed to cluster different runs/experiments or it clusters just spectra in one run/experiment. Can I use the clusterdendogram to cluster different batches of a recombinant protein drug ?

sgibb commented 7 years ago

I am afraid that we can't help you with this few lines of warnings and errors.

Somehow your msexp is missing sampleNames and the information whether it is centroided or not.

Lastly, do the comparison matrices compare any two different spectra or they compare the same peptide spectra in different samples ? Is the clusterdendogram supposed to cluster different runs/experiments or it clusters just spectra in one run/experiment.

The comparison method compares each spectrum against all the others in the MSnExp object. So there is no specific peptide/run/experiment handling. But you can subset your MSnExp object before, e.g. for the same peptide across different samples/runs.

Can I use the clusterdendogram to cluster different batches of a recombinant protein drug ?

I guess that should be possible.

lgatto commented 7 years ago

experiment <- removePeaks(msexp, t = 400, verbose = FALSE) There were 50 or more warnings (use warnings() to see the first 50)

Have a look at warnings() to see what they say. Surprisingly, I didn't get any warnings, but that was after setting the spectra to centroided.

Re centroided, you can set this with centroided(msexp) <- TRUE or setting readMSData(..., centroided = TRUE). If you have multiple MS level, see, they can be set independently - see ?OnDiskMSnExp.

There are a few things a bit weird in your code. For example, you type

qnt.max <- normalise(msexp, "max")

The name qnt.max indicates that this would be a quantitation result, but it's not; it's still an MSnExp object containing the raw data, only that the intensities have been normalised to 1. Afterwards, when you try to plot, is fails because the centroided status is not set (see above). But you wouldn't want to plot 18783 spectra anyway.

Re quantitation (with both methods), I think there's something else wrong. Somehow, the peptide sequences isn't are set to NA. I'll have too look closer.

lgatto commented 7 years ago

@NebuchadnezzaraHardrada - is it normal that all 1107 PSMs match one database entry, namely "EMBOSS_001"? That could actually be the culprit...

ghost commented 7 years ago

I managed to work on the problem using your suggestions and the following now stands.

  1. How does one create an MSnSet object ? As I currently understand, this is the only way to process more than one MSnexp object.

  2. experiment <- removePeaks(msexp, t = 400, verbose = FALSE) This piece of code does not produce any more error warnings after adding centroided = TRUE to readMSData.

  3. How can one plot figure 10 in the MSnBase vigenette ? When I use the following lines of code, I get a plot with only a single point and index instead of figure 10.

    ionCount(msexp[[1]])
    plot(ionCount(msexp[[1]]))
  4. Are the functions used to plot the figures in section 6.3 (Spectrum processing) available to the public or they were just for demonstration of proof of concept ?.

  5. When I try to do peptide counting as presented on Section 10 (Label Free MS^2 Quantification), I get the following error for the given line of code.

    > sc <- quantify(msexp, method = "count")
    Error in `sampleNames<-`(`*tmp*`, value = sampleNames(phenoData)) : 
    'value' length (0) must equal sample number in AssayData (1)
  6. As @sgibb suggested, I then tried comparing two matrices after centroiding the data and still got the following error:

    > compareSpectra(centroided[[1]], centroided[[2]],fun = "common")
    Error in centroided[[1]] : object of type 'closure' is not subsettable

    After adding labels, the error still persisted but the error type changed.

    > compareSpectra(centroided[[F1.S00001]], centroided[[F1.S00002]],fun = "common")
    Error in centroided[[F1.S00001]] : object 'F1.S00001' not found

    F1.0001 is what you get when you type the following line of code, which I assumed is the data label.

    > head(fData(msexp), n = 50)
          spectrum
    F1.S00001        1
    F1.S00002        2
    > compmat <- compareSpectra(centroided, fun="cor")
    Error in (function (classes, fdef, mtable)  : 
    unable to find an inherited method for function ‘compareSpectra’ for signature ‘"standardGeneric", "missing"’
    centroided <- pickPeaks(msexp, verbose = FALSE) (THIS ONE WORKS FINE), the next line of code does not.
    > (k <- which(fData(centroided)[, "PeptideSequence"] == "ASGVAVSDGVIK"))
    Error in `[.data.frame`(fData(centroided), , "PeptideSequence") : 
    undefined columns selected

    From my limited understanding,

    plot(centroided[[k[1]]], centroided[[k[2]]])

    is used to compare two different spectra from the same MSexp object, can the same be done to compare two different spectra from two different MSExp objects or the same spectrum (e.g X1) from a whole MSnSet ?

    @lgatto I expected all the PSMs to match one database entry because the sample contains a pharmaceutical grade recombinant protein drug searched against its own database. Had we had more hits, that would indicate that the protein drug contains other sequences and is therefore not pure. Since the peptide sequence is known and is essentially the same in all samples, figure 18 is my only hope in proving differences between batches because differences lie only in the intensities of specific peptides.

What is the best way to load more than 60 mzXML files with their mzID files, conduct clean up and removal of spectra and finally, compare the same spectra across all files (or two at a time) and finally cluster these comparisons with a dendogram ? Is MSnBase designed to do this or I am trying to use the wrong tool ?

lgatto commented 7 years ago

How can one plot figure 10 in the MSnBase vigenette ? When I use the following lines of code, I get a plot with only a single point and index instead of figure 10.

Identify your spectrum of interest, for example the 10th, then use

plot(msexp[[10]])

(possibly with additional options)

Are the functions used to plot the figures in section 6.3 (Spectrum processing) available to the public or they were just for demonstration of proof of concept ?.

It's only for demonstration, but there isn't anysthing special about it. You can either use base plotting and then add points or use ggplot2. In the vignette, I used

p3 <- plot(clean(removePeaks(ppsp,t=3)), full = TRUE, plot = FALSE) +
  theme_gray(5) +
  geom_point(size=3,alpha=I(1/3)) +
  geom_hline(yintercept=3,linetype=2) +
  ggtitle("Peaks < 3 removed and cleaned")

This error

> compareSpectra(centroided[[1]], centroided[[2]],fun = "common")
Error in centroided[[1]] : object of type 'closure' is not subsettable

indicates that you don't have a data object called centroided and R uses that function. Same explanation with labels. At the end of your message, it works with the MSnExp object msexp, not with the centroided function.

In the vignette, we also call an object centroided (not ideal), produced with centroided <- pickPeaks(itraqdata). If your data is already centroided, you won't need this step.

Regarding the error when quantifying, the problem stems from the nature of your identification results. I don't understand why, but the data indicates that all results come from the same acquisition number:

> iddf <- readMzIdData("~/Downloads/Eta6.mzid")
> table(iddf$acquisitionNum)

   2 
1107 

This messes up the filtering and the addition of your identification data to the raw data.

To load 60 files, you need the on disk infrastructure: use readMSData(..., mode = "onDisk"). The rest very much depends on exactly what you need to do. First, make sure things work with less files.

The package is MSnbase (not MSnBase) - many people do the mistake.

ghost commented 7 years ago

I tried working with 3 files and for some reason, the readMSData function is now giving an error.

x<-"E:/Data/Kundai/RStudio/XCMS/MZID/Trial2/MZXML" rawdata<-readMSData(x, mode="onDisk",centroided = TRUE) Error in .mzRBackend(x) : Could not determine file type for E:/Data/Kundai/RStudio/XCMS/MZID/Trial2/MZXML.

In the mean time, you can just answer the above mentioned question and close the issue. I will try to use other files with straightforward identification results and will get back to you if the problems still persists.Thanks for the help

lgatto commented 7 years ago

For multiple files, you need to provide a vector of file names, something like

> msdata::proteomics()
[1] "MRM-standmix-5.mzML.gz"                                                
[2] "MS3TMT10_01022016_32917-33481.mzML.gz"                                 
[3] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
[4] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"         
ghost commented 7 years ago

171004_hela_100ng.zip I tried the whole process again with a file obtained from the analysis of a HeLa standard. I am getting the following errors :

msexp <- addIdentificationData(msexp, id = identFile,verbose = FALSE) No scoring information available head(fData(msexp), n = 2) spectrum acquisition.number sequence chargeState rank passThreshold experimentalMassToCharge F1.S00001 1 52 F1.S00002 2 6443

idSummary(msexp) spectrumFile idFile coverage 1 Hela.mzXML 0

fData(msexp)$pepseq NULL

head(fData(msexp), n = 70) spectrum acquisition.number sequence chargeState rank passThreshold experimentalMassToCharge F1.S00001 1 52 F1.S00002 2 6443 F1.S00003 3 6464 F1.S00004 4 6472 F1.S00005 5 6475

It seems as if the number n, in some instances is limited and cannot go beyond, say 5000. Since both the standard and recombinant protein are "pure", ions elute at least 29 mins into the run and so I expected the first few spectra to be empty but then I cannot access those spectra after 10000 which I suspect have the compounds of interest.

plot(msexp[[10]]) Error in plot_Spectrum2(x, ...) : Please provide reporter ions if you do not want a full spectrum.

I do not understand why I have to provide reporter ions to the above piece of code.

I do not understand why it says there is no scoring information available when, the mzID file has "Proteome Discoverer Delta Scores" which I assume are the scores. Hela.mzid.txt

Here is a link to the raw file on dropbox. The mzID file is has been attached directly to this thread.

https://www.dropbox.com/s/6ho97icbpikoqyp/171004_hela_100ng.raw?dl=0