lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

MRM Analysis Help #568

Closed jamesrgraham closed 6 months ago

jamesrgraham commented 2 years ago

Hello,

This references an issue I posted a while ago:

https://github.com/lgatto/MSnbase/issues/551

So, I have a set of Agilent .d MRM files that I was able to convert to mzML and read in with readSRMdata() as described in the link, above.

I have a list of transitions for 50 compounds; rt, precursorMz, productMz, etc. and I want to find these in the mzML files and report the intensities.

So:

# load in a single file
raw11 <- readSRMData(files = "20211109_pooleds_08.mzML")

Now, if I do the following:

write.table(raw11, sep = "\t", file = "out_file.txt")

I get a file starting with:

new("Chromatogram", rtime = c(0.00886666666666667, 0.0190333333333333 [...]

This contains the rtimes, intensities, mz (precursor, product), etc.

Is this every mass found in the file? So, to get the intensities of each compound, would it be a matter of "simply" cross-referencing the RTs/MZs and intensities and adding them up? And if a given transition pair is not reported, does that mean it simply wasn't found?

I've also tried findChromPeaks() but the results were similar and since this will process many types of compounds, I can't have custom peak-picking parameters for each run.

From findChromPeak, I can export it:

new("XChromatogram", chromPeaks = c(9.86523333333333, 0.00886666666666667, 26.9917, 3137.98745796791, 136236.712714377, 6163.96044921875, 6658.80879462996, 57.2791738790809), chromPeakData = new("DFrame", rownames = NULL, nrows = 1, listData = list(ms_level = 1, is_filled = FALSE), elementType = "ANY", elementMetadata = NULL, metadata = list()), rtime = c(0.00886666666666667, 0.0190333333333333 [...]

It seems the output here is very similar (rtime, etc.)


Another issue is the comparison of intensities from this process compared to the intensities output by Skyline. Do you have any experience in comparing the output of Skyline to that of R? The numbers can be different.

Thank you for any insight you can provide, james

jorainer commented 2 years ago

I quite don't understand what exactly you want to do, but exporting the data with write.table is not what you should aim for - that function simply calls as.character on the object containing the data and writes the output.

If you called findChromPeaks you can get a matrix with the integrated peak areas using the chromPeaks function. Ideally, have a look at the specific help pages for the objects being able to represent chromatographic data (i.e. ?Chromatogram ?MChromatograms in MSnbase and ?XChromatogram from the xcms package), there you will find some more info on how to actually access and extract data from these objects.

jamesrgraham commented 2 years ago

I have a list of amino acids (but it could be anything) for which I have expected retention time and precursor/product mz information.

For example: Valine: RT: 7.9min; Mzprecursor: 118.08; Mzproduct: 72.1 AminoPentanoic acid: RT: 10.7min; Mzprecursor: 118.08; Mzproduct: 101

If I write out a table (I know it's as.character, but I can process it outside of R) of raw11 from this:

raw11 <- readSRMData(files = "20211109_pooleds_08.mzML")

Each line of the output has a precursor/product pair identifier (excerpt: mz = c(NA, NA), filterMz = c(NA, NA), precursorMz = c(104.07, 104.07), productMz = c(58.196, 58.196), fromFile = 1,[...] (this is the identifier for alpha aminobutyric acid, fyi)) along with paired lists of retention times and intensities. So, I wrote a simple parser in perl to assemble the data; adding the intensities if within certain parameters (RT range, for example) to get the intensity of a given precursor/product pair. I don't know if this is correct.

When I run findChromPeaks, I can access the data via chromPeaks

cp <- findChromPeaks(raw11, param = mfp)
cp_out <- chromPeaks(cp)

Then cp_out looks like you would imagine it:

            rt       rtmin    rtmax      into      intf       maxo      maxf       sn row column
 [1,] 7.950067 0.007966667 21.53313 19986.593 817832.62 164986.953 55225.430 72.79093   6      1
 [2,] 7.837617 0.007516667 27.00053 18008.686 781024.00 152890.172 48975.998 73.43726   8      1
[...]

But I don't know the precursor/product MZs. I, of course, can use filterRt (I noted the default unit is minutes here...is that based on the input data?) and filterMz, but I question that approach.

I hope I'm clearer this time!

So, the output would be a list of the precursor/product pairs and their associated intensities.

jorainer commented 2 years ago

Thanks for the clarification. Ideally, you should be able to everything you want to do in R - without the need for import/export. To explain: the MChromatograms (or then the XChromatograms once you call findChromPeaks) is organized as a matrix, each row should represent a single combination of precursor and product m/z, each column one mzML file (sample). With the functions precursorMz and productMz on your imported data you can get the precursor and product m/z values for each row. In the output of the chromPeaks call above you can also see columns "row" and "column" which indicates the row and sample in which the chromatographic peak was identified (in your case for rows 6 and 8 a peak was found in sample 1). This should give you all needed info to work with your data (even maybe without filterRt and filterMz?).

Regarding the unit of the retention times - that depends on the input files. So far I've seen only RT in seconds, but maybe you have them in minutes?

jamesrgraham commented 2 years ago

Thank you very much!

Just so that I am perfectly clear here because the data set I am working with has been analyzed by someone using Skyline and I have all the intensities for each compound in the files from Skyline, but the numbers I'm getting here are quite different.

So, the output of chromPeaks has 50 rows. This is the same number of compounds in my transition list:

> chromOut <- chromPeaks(cp)
> chromOut
             rt        rtmin    rtmax        into        intf         maxo      maxf       sn row
 [1,]  9.865233 8.866667e-03 26.99170  3137.98746  136236.713   6163.96045  6658.809 57.27917   1
 [2,] 11.199100 8.866667e-03 24.37488 18186.35083  798307.718 126085.32812 50817.838 74.89145   2
 [3,] 26.991700 2.542365e+01 26.99170    74.91288    5662.638     66.68001  7322.346 10.79111   2
 [4,] 10.567583 8.633333e-03 26.99148  1819.18958   79064.907  10531.56055  3820.905 56.69437   3
 [5,] 10.608083 8.416667e-03 26.99125  1595.24131   69285.112   8672.84082  3229.372 54.64408   4
 [6,] 13.346867 8.183333e-03 26.99102  1100.59197   47627.832    150.12001  1799.950 44.14541   5
 [7,]  7.950067 7.966667e-03 21.53313 19986.59339  817832.618 164986.95312 55225.430 72.79093   6
 [8,] 27.000983 2.165532e+01 27.00098   468.56044   70797.294    354.34003 27563.392 36.33046   6
[...]

(I truncated the last column, "column.")

When I run precursorMz (or the product version), I get 45 rows.

> precursorMz(cp)
       mzmin  mzmax
 [1,] 104.07 104.07
 [2,] 104.07 104.07
 [3,] 106.04 106.04
 [4,] 110.05 110.05
 [5,] 114.06 114.06
 [6,] 116.07 116.07
 [7,] 118.08 118.08
[...]

Does the "rows" column in the output of chromPeaks refer to the rows in the precursorMz output? I assume this is the case as the "rows" only go up to 45). So, it would just be a matter of cross-referencing those in order to "link" the MZs with the RTs and intensities?

And when the row numbers in the chromPeaks output are duplicated, it simply found that precursor/product pair at another RT?

So, in this example, if I look at row 2 and 3 from chromPeaks:

[2,] 11.199100 8.866667e-03 24.37488 18186.35083  798307.718 126085.32812 50817.838 74.89145   **2**
[3,] 26.991700 2.542365e+01 26.99170    74.91288    5662.638     66.68001  7322.346 10.79111   **2**

These rows have 2 in the "row" column, which in the precursor/productMz is: 104.07 and 87.096.

This would make it gamma aminobutyric acid with an expected RT of ~11.1 minutes. with the intensity (into: 18186.35083 intf: 798307.718). The intensity from Skyline for this compound in this file is 990324.875. (I don't understand the differences from Skyline...and these differences keep coming up, especially when folks are staunch Skyline users.)

Because row 3 also references row 2 in the precursor output, then it also found the 104.07/87.096 pair at 26.99 minutes? This is the end of the run. And I know the intensities are extremely low to the point of rejecting it, but I want to make sure I'm at least on the right track.

thanks again! james

jorainer commented 2 years ago

Does the "rows" column in the output of chromPeaks refer to the rows in the precursorMz output?

yes. The "row" column should be equivalent to the number of chromatograms/transitions you have. The "column" column refers to the sample (or mzML file). If you read 5 mzML files, you'll have 5 columns.

So, it would just be a matter of cross-referencing those in order to "link" the MZs with the RTs and intensities?

Exactly. You could basically do that with precursorMz(cp)[chromOut[, "row"], ] and combine that with the chromOut.

And when the row numbers in the chromPeaks output are duplicated, it simply found that precursor/product pair at another RT?

That depends on the value in the "column" column. It could be a peak in another sample, or be additional peaks that were detected by the algorithm in the same chromatogram. Maybe best to plot the data (plot(cp[2, ]) to see what it is and how the peak look like.

I don't know how Skyline integrates the peak signal, but if the integration is performed differently you can get different results. I would definitely first check if the detected peaks look OK (by plotting them) and eventually adapt the peak detection settings.

jamesrgraham commented 2 years ago

I have yet another question.

I've followed this procedure for a single file:

mfp <- MatchedFilterParam(binSize = 0.1, impute="lin", snthresh = 2)

raw11 <- readSRMData(files = "20211109_pooleds_08.mzML")

cp <- findChromPeaks(raw11, param = mfp)

chromOut <- chromPeaks(cp)
preTEST <- precursorMz(cp)[chromOut[, "row"], ]
preTEST <- as.vector(preTEST)
prodTEST <- productMz(cp)[chromOut[, "row"], ]
prodTEST <- as.vector(prodTEST)

chromOutDF <- as.data.frame(chromOut)
into <- chromOutDF$into
into <- as.vector(into)
intf <- chromOutDF$intf
intf <- as.vector(intf)
rt   <- chromOutDF$rt
rt   <- as.vector(rt)
fnoutDF <- data.frame(preTEST, prodTEST, into, intf, rt)

When I write out the table (and sort on precursorMz, then productMz, then RT to get all compounds together), I get doubled lines, like this:

preTEST prodTEST    into    intf    rt
76.06   29.996  1167.168707 50545.62772 11.70933333
76.06   29.996  1167.168707 50545.62772 11.70933333
79.07   31.996  1140.473621 49386.54004 11.7702
79.07   31.996  1140.473621 49386.54004 11.7702
90.06   43.996  3178.60995  138230.4354 9.743716667
90.06   43.996  3178.60995  138230.4354 9.743716667
94.07   46.996  2310.862353 100428.8582 9.641666667
94.07   46.996  2310.862353 100428.8582 9.641666667
[...]

I'm not sure if that's coming from what I'm doing or whether it is in the mzML file (this is from a single file).

james

jorainer commented 2 years ago

Please check what data type precursorMz(cp) returns. Could be that the lower and upper m/z are reported for precursor and product m/z (note that even if you provide a single value m/z value, internally it is converted to a lower-upper m/z to support also specifying m/z ranges). If a list or matrix (with two columns) is returned I suggest to calculate the mean m/z for each.

jamesrgraham commented 2 years ago

Thank you. Once I recover from a major computer crash, I will look into it...but I am pretty sure that the upper/lower MZs were identical.

jamesrgraham commented 2 years ago

Meaning, they were identical in the output (removing duplicates worked), not sure if the numbers are truncated in the output (though, I've never seen that before).

jorainer commented 2 years ago

Yes, exactly, they will be identical if a single m/z value was available (i.e. a target precursor m/z and not a m/z range). To solve your problem I assume you would need to reduce this range down again to a single value before you create the various data.frames.