rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

An attempt to convert a spectra file to a data frame #251

Closed linlennypinawa closed 1 year ago

linlennypinawa commented 1 year ago

I attempted to visualize EIC chromatogram and spectrum using ggplot2, because these plots look like what instrument software provide.

There are 271,155 data points.

I run the following lines of code

a_list <- list()

n = 1

for (i in 1:length(raw_spec)){
  for (j in 1: lengths(raw_spec[i])){
    retention_time <- rtime(raw_spec)[[i]]
    mz <- mz(raw_spec)[[i]][j]
    intensity <- intensity(raw_spec)[[i]][j]
    list_temp <- data.frame(retention_time, mz)
    list_temp$n <- n
    a_list[[n]] <- list_temp
    n = n+1
  }
  n = n+1
}

raw_df <- do.call(rbind, a_list)

a_list <- list()

list_temp <- data.frame()

It has taken over 35 minutes while typing this message.

I also get an error message repeatedly, Error in x$.self$finalize() : attempt to apply non-function. I understand I can ignore it.

Is there a better way to convert it to the data frame?

linlennypinawa commented 1 year ago

I figured it out. Here is a better way.

mz_int_df <- as.data.frame(pk_data)

colnames(mz_int_df)[1] <- "n"

a_list <- list()

n = 1

for (i in 1:length(raw_spec)){
  retention_time <- rtime(raw_spec)[[i]]
  list_temp <- data.frame(retention_time)
  list_temp$n <- n
  a_list[[n]] <- list_temp
  n = n+1
}

rt_df <- do.call(rbind, a_list)

a_list <- list()

list_temp <- data.frame()  

new_df <- left_join(mz_int_df, rt_df, by = "n")

There are columns of retention time, mz, and intensity in the data frame.

Is there any other potential problem?

jorainer commented 1 year ago

Note that, in addition to the base R plotting functions provided in Spectra, there is the possibility to use the SpectraVis package.

I will have a closer look into the function you provide and check if it's OK (or if there is a simpler way).

jorainer commented 1 year ago

A maybe faster and more efficient way to extract rtime, mz and intensity values from a Spectra could be the function below. You would get a data.frame with the requested columns - ideally the Spectra should however contain only values from a single file/sample.

ms_data_frame <- function(x) {
    pks <- peaksData(x)
    npks <- vapply(pks, nrow, integer(1))
    res <- as.data.frame(do.call(rbind, pks))
    res$rtime <- rep(rtime(x), npks)
    res
}
linlennypinawa commented 1 year ago

A maybe faster and more efficient way to extract rtime, mz and intensity values from a Spectra could be the function below. You would get a data.frame with the requested columns - ideally the Spectra should however contain only values from a single file/sample.

ms_data_frame <- function(x) {
    pks <- peaksData(x)
    npks <- vapply(pks, nrow, integer(1))
    res <- as.data.frame(do.call(rbind, pks))
    res$rtime <- rep(rtime(x), npks)
    res
}

Wow! It is amazing. short and powerful.

Although I figured out to process multiple files/samples by following my logic process, there are many lines of code.s I am going to implement your way of processing data in my codes.

linlennypinawa commented 1 year ago

Note that, in addition to the base R plotting functions provided in Spectra, there is the possibility to use the SpectraVis package.

I will have a closer look into the function you provide and check if it's OK (or if there is a simpler way).

I share my work with you for a little while. I am open for feedback.

http://iseq20-lenny-lin.shinyapps.io/MS_data_mining_shiny_rev02b?_ga=2.59275566.2050638645.1669072545-676217538.1669072545

jorainer commented 1 year ago

If you have multiple files you could use:

res <- spectrapply(sps, f = sps$dataOrigin, FUN = ms_data_frame)

I haven't tried, but this should first split the Spectra sps by original data file (sps$dataOrigin) and then apply the function ms_data_frame to each of these. As a result you should get a list of data.frames. The length of the list should be equal to the number of original data files (i.e. each element is the m/z, rt, intensity data.frame for one file). You could even run this in parallel by passing BPPARAM = MulticoreParam(3) as additional parameter. Note however that with the simple function above you will end up having all peaks from all MS levels. If that's not what you want you might need to use filterMsLevel at some point.

Also, please consider to re-use as much existing functionality (e.g. from Spectra or SpectraVis or others) - this avoids to repeat making the same mistakes over again - what can easily happen. Also, if you think your code might be good and useful also for others you might consider contributing to one of our packages. We're open for contributions if the code is useful, clean, well documented and tested... just maybe also good to keep in mind.

linlennypinawa commented 1 year ago
res <- spectrapply(sps, f = sps$dataOrigin, FUN = ms_data_frame)

yes, the line of code works. Amazing!

I will try your approach to extract other info, such as TIC, BPC, and file names. The column names of my data frame are: file ID( sample_name), retention time, mz, intensity, tic, and bpc.

I am a lab scientist. Although I am passionate about R and packages in mass spec applications, I am learning from you. My lines of codes are more lengthy than yours. Moreover, I see file names in res, but I have no idea how I could get it. I wish I could contribute to one of your packages meaningfully.