tvpham / iq

An R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics
BSD 3-Clause "New" or "Revised" License
22 stars 9 forks source link

With the default `BGS factory report` from the latest Spectronaut, iq says "Do not know what to do with (Settings)" #8

Closed fcyu closed 1 year ago

fcyu commented 1 year ago

First of all, thank you for developing this wonderful package. It makes the MaxLFQ intensity calculating and report generating much easier and faster.

However, it seems to have issues when the column names have spaces or parentheses. Following is the script I was using

rm(list=ls())

library(iq)

path <- "G:\\test.tsv"
out_path <- "out.tsv"

df <- fast_read(path,
                sample_id = "R.FileName",
                primary_id = "EG.ModifiedSequence",
                secondary_id = c("EG.ModifiedSequence", "FG.Charge"),
                intensity_col = "EG.TotalQuantity (Settings)",
                annotation_col = NULL,
                filter_string_equal = NULL,
                filter_string_not_equal = NULL,
                filter_double_less = c("PG.Qvalue" = 0.01, "PG.QValue (Run-Wise)" = 0.01, "EG.Qvalue" = 0.01),
                filter_double_greater = NULL,
                intensity_col_sep = NULL,
                intensity_col_id = NULL,
                na_string = c("Filtered", "", "NA", 0))
df_norm <- fast_preprocess(df$quant_table, median_normalization = FALSE, log2_intensity_cutoff = 0, pdf_out = NULL)
df_maxlfq <- fast_MaxLFQ(df_norm, row_names = df$protein[, 1], col_names = df$sample)
maxlfq <- df_maxlfq$estimate
maxlfq[maxlfq <= 0] <- NA
write.table(2^maxlfq, out_path, quote = FALSE, sep = "\t", col.names = NA)

I used EG.TotalQuantity (Settings) column as intensity and PG.QValue (Run-Wise) as one of the FDR filtering. iq threw an error saying "Do not know what to do with (Settings)". Removing the space and parentheses solved the issue.

I know that the read.delim converts disallowed characters to ., and read_tsv surround the column name with `` when there are disallowed characters. I am not sure if iq has a similar approach so I need to put something else to the parameter.

Here is the exported Spectronaut report from a public data: test.zip

Thanks,

Fengchao

tvpham commented 1 year ago

Hi Fengchao,

Thank you for your kind words, and for the complete script & data.

Actually, one would need to remove only the spaces in the column names. It is unfortunate that we use spaces internally to pass the variables. I will change that in the future. But it will take some time because such changes require quite a bit of testing.

I could not finish your script though. The "EG.ModifiedSequence" is in both primary_id and secondary_id ?

Also, 'na_string' accepts a single string only. I will put it on the todo list to accept multiple values.

Best, Thang

fcyu commented 1 year ago

Hi Thang,

Thank you very much for the prompt response.

I could not finish your script though. The "EG.ModifiedSequence" is in both primary_id and secondary_id ?

To be honest, I'm not entirely sure I understand the primary_id and secondary_id. As far as I know, the primary_id is the "row id" in the final intensity matrix, and the secondary_id indicates the units to calculate the intensity for the primary_id. Is that correct? So, in the above script, I wanted to calculate the intensity of modified sequences using the precursors (modified sequence + charge).

Also, 'na_string' accepts a single string only. I will put it on the todo list to accept multiple values.

Thank you for pointing it out. Some more detailed description in the document would be much appreciated.

Best,

Fengchao

tvpham commented 1 year ago

Hi Fengchao,

I've updated the package. The new version v1.9.9 support spaces and most other characters in the column names. So your statement should work: ` df <- fast_read(path, sample_id = "R.FileName", primary_id = "EG.ModifiedSequence",

secondary_id = c("EG.ModifiedSequence", "FG.Charge"),

            secondary_id = c("FG.Charge"),
            intensity_col = "EG.TotalQuantity (Settings)",
            annotation_col = NULL,
            filter_string_equal = NULL,
            filter_string_not_equal = NULL,
            filter_double_less = c("PG.Qvalue" = 0.01, "PG.QValue (Run-Wise)" = 0.01, "EG.Qvalue" = 0.01),
            filter_double_greater = NULL,
            intensity_col_sep = NULL,
            intensity_col_id = NULL,
            na_string = "Filtered")

`

To be honest, I'm not entirely sure I understand the primary_id and secondary_id. As far as I know, the primary_id is the "row id" in the final intensity matrix, and the secondary_id indicates the units to calculate the intensity for the primary_id. Is that correct? So, in the above script, I wanted to calculate the intensity of modified sequences using the precursors (modified sequence + charge).

Yes, primary_id is the output row_id. The secondary_id are entries contributing to the row_id (I think you got it correct also. It is just a concept that hard to explain very clearly). So if you want to collapse multiple charge states, you can just say secondary_id = c("FG.Charge").

Also, 'na_string' accepts a single string only. I will put it on the todo list to accept multiple values.

You can also use filter_string_not_equal option to filter out entries corresponding to NA values.

Cheers, Thang

fcyu commented 1 year ago

Hi Thang,

Thanks for your explanation.

Yes, primary_id is the output row_id. The secondary_id are entries contributing to the row_id (I think you got it correct also. It is just a concept that hard to explain very clearly). So if you want to collapse multiple charge states, you can just say secondary_id = c("FG.Charge").

I am a little confused. Should I use secondary_id = c("EG.ModifiedSequence", "FG.Charge") rather than secondary_id = c("FG.Charge") because I want to collapse all precursors with the same EG.ModifiedSequence+FG.Charge?

Best

Fengchao

tvpham commented 1 year ago

I am a little confused. Should I use secondary_id = c("EG.ModifiedSequence", "FG.Charge") rather than secondary_id = c("FG.Charge") because I want to collapse all precursors with the same EG.ModifiedSequence+FG.Charge?

As it is now, you have EG.ModifiedSequence as row IDs. If you want row IDs as EG.ModifiedSequence+FG.Charge, then you need to concatenate the two columns into one (using R or awk) and use the concatenated columns as primary_id. But then you will still need to specify the secondary_id.

Think about secondary_id as columns that make the primary_id NOT unique. If there is no duplicate after concatenation of EG.ModifiedSequence and FG.Charge, then you do not need to run iq.

Thang

fcyu commented 1 year ago

Thank you very much for your prompt response.

Best,

Fengchao