symbioticMe / proBatch

Tools for Batch Effects Diagnostics and Correction
15 stars 6 forks source link

"FullRunName" appears hardcoded into "plot_sample_mean" #14

Closed MayaPetek closed 2 years ago

MayaPetek commented 2 years ago

I'll start with appreciation that the proBatch package exists, it solves a lot of issues I'm facing just getting started with doing batch normalisation in a long proteomics experiment properly.

I am following the example code given in the vignette, using the example data and my own data for comparison.

My own data originates from SWATH data processed with DIA-NN, which obviously generates its own column names. According to documentation and the vignette, that shouldn't be a problem, with the clear caveat that the naming needs to be consistent between the three files.

PROBLEM The plot_sample_mean doesn't specify the sample_id_column value as a parameter, yet it seems that the required column name is FullRunName. When the name is set to something else, the following error occurs:

> plot_sample_mean(sppa_log_matrix, sppa_annotation) Error in check_sample_consistency(sample_annotation, sample_id_col, df_ave, : Sample ID column FullRunName is not defined in sample annotation, sample annotation cannot be used for correction/plotting

However, when the check_sample_consistency function is used on the same data (with matching RunID column names, where the name is something other than FullRunName), the two dataframes are merged without issue, as expected.

As far as I can tell, the column name for the RunID/MS filename/some unique ID is set in three places:

Changing the RunID column to 'FullRunName' resolves the issue. The function plot_sample_mean gives the same error if colouring according to batch is used or not.

It appears that the issue can be avoided with correct naming in the source data files, which is not specified in the vignette. This is either a documentation or an implementation issues.

I hope the description here will help others, as the proBatch package really does look useful.

MayaPetek commented 2 years ago

OK, on further investigation and with other other functions, it looks to be a simple case of parameters for these functions which are described in a somewhat misleading way in the vignette. My oversight.