I'll start with appreciation that the proBatch package exists, it solves a lot of issues I'm facing just getting started with doing batch normalisation in a long proteomics experiment properly.
I am following the example code given in the vignette, using the example data and my own data for comparison.
My own data originates from SWATH data processed with DIA-NN, which obviously generates its own column names. According to documentation and the vignette, that shouldn't be a problem, with the clear caveat that the naming needs to be consistent between the three files.
PROBLEM
The plot_sample_mean doesn't specify the sample_id_column value as a parameter, yet it seems that the required column name is FullRunName. When the name is set to something else, the following error occurs:
> plot_sample_mean(sppa_log_matrix, sppa_annotation)Error in check_sample_consistency(sample_annotation, sample_id_col, df_ave, : Sample ID column FullRunName is not defined in sample annotation, sample annotation cannot be used for correction/plotting
However, when the check_sample_consistency function is used on the same data (with matching RunID column names, where the name is something other than FullRunName), the two dataframes are merged without issue, as expected.
As far as I can tell, the column name for the RunID/MS filename/some unique ID is set in three places:
a column heading in the annotation table
assuming the precursor intensities are imported in a wide-format measurement table, the ID is set manually when converting to long format, using standard R manipulation
# read in intensity datameasurement_wide <- read.table(file = 'measurement_table.tsv', sep = '\t', header = TRUE)measurement_long <- melt(setDT(measurement_wide), id.vars = c("Protein.Group", "Precursor.Id"), variable.name = "MS_file")
by defining the sample_id_column variable (this one might be optional)
Changing the RunID column to 'FullRunName' resolves the issue.
The function plot_sample_mean gives the same error if colouring according to batch is used or not.
It appears that the issue can be avoided with correct naming in the source data files, which is not specified in the vignette. This is either a documentation or an implementation issues.
I hope the description here will help others, as the proBatch package really does look useful.
OK, on further investigation and with other other functions, it looks to be a simple case of parameters for these functions which are described in a somewhat misleading way in the vignette. My oversight.
I'll start with appreciation that the proBatch package exists, it solves a lot of issues I'm facing just getting started with doing batch normalisation in a long proteomics experiment properly.
I am following the example code given in the vignette, using the example data and my own data for comparison.
My own data originates from SWATH data processed with DIA-NN, which obviously generates its own column names. According to documentation and the vignette, that shouldn't be a problem, with the clear caveat that the naming needs to be consistent between the three files.
PROBLEM The
plot_sample_mean
doesn't specify thesample_id_column
value as a parameter, yet it seems that the required column name is FullRunName. When the name is set to something else, the following error occurs:> plot_sample_mean(sppa_log_matrix, sppa_annotation)
Error in check_sample_consistency(sample_annotation, sample_id_col, df_ave, : Sample ID column FullRunName is not defined in sample annotation, sample annotation cannot be used for correction/plotting
However, when the
check_sample_consistency
function is used on the same data (with matching RunID column names, where the name is something other than FullRunName), the two dataframes are merged without issue, as expected.As far as I can tell, the column name for the RunID/MS filename/some unique ID is set in three places:
# read in intensity data
measurement_wide <- read.table(file = 'measurement_table.tsv', sep = '\t', header = TRUE)
measurement_long <- melt(setDT(measurement_wide), id.vars = c("Protein.Group", "Precursor.Id"), variable.name = "MS_file")
sample_id_column
variable (this one might be optional)Changing the RunID column to 'FullRunName' resolves the issue. The function
plot_sample_mean
gives the same error if colouring according to batch is used or not.It appears that the issue can be avoided with correct naming in the source data files, which is not specified in the vignette. This is either a documentation or an implementation issues.
I hope the description here will help others, as the proBatch package really does look useful.