tvpham / iq

An R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics
BSD 3-Clause "New" or "Revised" License
22 stars 9 forks source link

Minimum number of peptides for LFQ #10

Open pisistrato opened 1 year ago

pisistrato commented 1 year ago

Hi Thang,

Here #7 you have proposed a way to calculate the number of peptides used for quantification to filter the final MaxLFQ table.

I was wondering, how is this different from the option peptide_extractor in the process_long_format function? I am getting slightly different number of proteins using the two approaches (DIA-NN output).

tvpham commented 1 year ago

Hi, I think the difference might be due to the modified peptides. For the data in #7, if I remove all the modifications, I get the same result as the previous script.

process_long_format("DIA-Report-long-format.txt",
                    output_filename = "iq-MaxLFQ.tsv", 
                    sample_id  = "R.FileName",
                    primary_id = "PG.ProteinGroups",
                    secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge", "F.FrgLossType"),
                    intensity_col = "F.PeakArea",
                    annotation_col = c("PG.Genes", "PG.ProteinNames", "PG.FastaFiles"),
                    filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
                    filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
                    peptide_extractor = function(x) gsub("\\[Oxidation \\(M\\)\\]", "", 
                                                         gsub("\\[Carbamidomethyl \\(C\\)\\]", "", 
                                                              gsub("\\[Acetyl \\(Protein N-term\\)\\]", "",
                                                                   gsub("_.[0-9].*$", "", x)))),
                    log2_intensity_cutoff = -1000)

The regex in peptide_extractor is a bit messy. Basically, we want entries such as "[Acetyl (Protein N-term)]M[Oxidation (M)]EDMNEYSNIEEFAEGSK_.2" be reduced to "MEDMNEYSNIEEFAEGSK" to count the number of unique peptides.