Name of blocking variable not correctly recognized

aghr commented 4 months ago

Description of the bug

When having in samplesheet.csv a variable (column header) RNA_extraction_date and when I use this as a blocking variable in the contrasts.csv then RNA_extraction_date from contrasts.csv is recognized as R_extraction_date and I get the error that R_extraction_date is not in samplesheet.csv

To me this looks like an issue with the 'NA' in RNA_extraction_data. NA is also the default in R for specifying 'Not Available' and this might cause the 'NA' in RNA_extraction_date to be replaced by the empty string leading to R_extraction_date. When I change RNA_extraction_date to Rna_extraction_date in samplesheet.csv and contrasts.csv then it seems to work.

Interestingly: the process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv) finishes with success but it looks as if this validator checks only samplesheet.csv but not contrats.csv. The error happens in process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL which runs after the validator finished with success.

Command used and terminal output

Header of samplesheet.csv:

sample,fastq_1,fastq_2,strandedness,condition,genotype,sex_Xist,litter_date,dev_time,replicate,flowcell,position_in_wellplate,RNA_extraction_date,genotype_id,NREADS_VST_XIST,SAMPLE_RNA_CONC,LIBPREP_RNA_CONC,row_in_wellplate,col_in_wellplate,SAMPLE_RNA_CONC_2,LIBPREP_RNA_CONC_2,sex_Eif2s3y,sex

contrasts.csv:

id,variable,reference,target,blocking
A_vs_B,condition,A,B,RNA_extraction_date

#########################
Validator finishes with success:

NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv)

########################
But then:

ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:
condition, reference:A, target:B, blocking:R_extraction_date])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:condition, reference:A, target:B, blocking:R_extraction_date])` terminated with an error exit status (1)

Command executed [/XXX/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/deseq2/differential/templates/
deseq_de.R]:

#######################

  Error: Blocking variables R_extraction_date do not correspond to sample sheet columns.
  Execution halted

Relevant files

No response

System information

version 23.10.0 build 5889 (created 15-10-2023 15:07 UTC (17:07 CEST))
Linux desktop
CentOS Linux release 7.7.1908
local execution
with singularity container
version of nfcore diffabundance: 1.4.0

BEFH commented 3 months ago

This bug is due to this line of code:

https://github.com/nf-core/differentialabundance/blob/a3d664c12c4050bae2acc83b1c636dcc3546b9a5/workflows/differentialabundance.nf#L319

It probably needs to be changed to:

it.blocking = it.blocking.replaceAll('^NA$', '')

aghr commented 3 months ago

The comment directly before this code block reads:

// Split the contrasts up so we can run differential analyses and // downstream plots separately. // Replace NA strings that might have snuck into the blocking column

Why would this code block try to replace NA sub-strings in column headers of the samplesheet.csv file? Maybe you apply it to samplesheet.csv, too?

BEFH commented 3 months ago

The issue is that the current code replaces any "NA" in the string. Mine only replaces it if it's the whole string. However, come to think of it, IDK why it's replacing it in the variable name anyway, and whether that's desired behavior

WackerO commented 3 months ago

I assume (carefully) that .splitCsv might add NAs when it finds an empty column; is that correct, @pinin4fjords?

Edit: No it doesn't, at least not in a little test I ran. Why might NAs sneak in? 🤔

pinin4fjords commented 3 months ago

Thanks for the bug report!

This was definitely done in response to a bug, possibly people using NA in the input contrasts files to indicate missing values. So I don't want to remove this entirely.

@BEFH - could you PR your regex fix please, since it's so concise? Please add a changelog entry in the same style as the others there when you do so.

We should also document the special meaning of 'NA'.

aghr commented 3 months ago

Hey, checking for and dealing with NA entries in contrasts.csv seems reasonable. But, replacing any pattern match of 'NA' with the empty string will destroy meaningful entries like 'RNA_CONCENTRATION' or 'ANALYSIS_OUTCOME' and so on. Would it be possible to replace 'NA' with the empty string only if the complete entry is 'NA' (in perl ^NA$) but otherwise don't replace NAs?

pinin4fjords commented 3 months ago

Yes, that's the fix proposed by @BEFH

nf-core / differentialabundance