nf-core / differentialabundance

Differential abundance analysis for feature/ observation matrices from platforms such as RNA-seq
https://nf-co.re/differentialabundance
MIT License
60 stars 33 forks source link

Sample check failure with names including '-' symbol #69

Closed clstacy closed 1 year ago

clstacy commented 1 year ago

Description of the bug

Pipeline failing with input and matrix sample names including "-" character on CUSTOM_MATRIXFILTER step, error gives all names of samples with "-" ID. When I check the work folder for the process, the names in each of the files have had "-" changed to ".", but I think this might be being compared to the original names and failing?

If I change names with sed in input and matrix file to remove "-", the pipeline continues without error.

Command used and terminal output

nextflow run nf-core/differentialabundance \
    -profile singularity \
    -r dev \
    --input diffabundance/inputs/samplesheet.csv \
    --contrasts diffabundance/inputs/serCan_contrasts.csv \
    --matrix diffabundance/inputs/data/rsem.merged.transcript_counts.tsv \
    --gtf reference/serCan2020/serCan2020_no_genes.gtf \
        -w diffabundance/work \
    --features_id_col 'transcript_id' \
    --outdir diffabundance/results \
    --max_memory '160.GB' \
    --deseq2_cores 8 \
    -resume

# output:
executor >  local (2)
[32/9bd716] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, cached: 1 ✔
[52/70c121] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1 ✔
[ed/3225f0] process > NFCORE_DIFFERENTIALABUNDANC... [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
[-        ] process > NFCORE_DIFFERENTIALABUNDANC... -
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/differentialabundance] Pipeline completed with errors-
Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER ([id:study])` terminated with an error exit status (1)

Command executed [/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/custom/matrixfilter/templates/matrixfilter.R]:

  #!/usr/bin/env Rscript

  # Filter rows based on the number of columns passing the abundance threshold. By
  # default this will be any row with a value of 1 or more, which would be a
  # permissive threshold for RNA-seq data.
  #
  # In RNA-seq studies it's often not enough to just remove genes not expressed in
  # any sample. We also want to remove anything likely to be part of noise, or
  # which has sufficiently low expression that differential analysis would not be
  # useful. For that reason we might require a higher threshold than 1, and
  # require that more than one sample passes.
  #
  # Often we want to know that a gene is expressed in a substantial enough number
  # of sample that differential analysis worthwhile, so we may pick a threshold
  # sample number related to group size. Note that we do not filter with an
  # awareness of the groups themselves, since this adds bias towards discovery
  # between those groups.

  ################################################
  ################################################
  ## Functions                                  ##
  ################################################
  ################################################

  #' Parse out options from a string without recourse to optparse
  #'
  #' @param x Long-form argument list like --opt1 val1 --opt2 val2
  #' @param opt_defaults A lis with default argument values
  #'
  #' @return named list of options and values similar to optparse

  parse_args <- function(x, opt_defaults){
      args_list <- unlist(strsplit(x, ' ?--')[[1]])[-1]
      args_vals <- unlist(lapply(args_list, function(y) strsplit(y, ' +')), recursive = FALSE)

      # Ensure the option vectors are length 2 (key/ value) to catch empty ones
      args_vals <- lapply(args_vals, function(z){ length(z) <- 2; z})

      parsed_args <- structure(lapply(args_vals, function(x) x[2]), names = lapply(args_vals, function(x) x[1]))
      parsed_args[! is.na(parsed_args)]

      # Now apply CLI options to override defaults

      opt_types <- lapply(opt_defaults, class)

      for ( ao in names(parsed_args)){
          if (! ao %in% names(opt_defaults)){
              stop(paste("Invalid option:", ao))
          }else{

              # Preserve classes from defaults where possible
              if (! is.null(opt_defaults[[ao]])){
                  parsed_args[[ao]] <- as(parsed_args[[ao]], opt_types[[ao]])
              }
              opt_defaults[[ao]] <- parsed_args[[ao]]
          }
      }
      opt_defaults
  }

  #' Flexibly read CSV or TSV files
  #'
  #' @param file Input file
  #' @param header Passed to read.delim()
  #' @param row.names Passed to read.delim()
  #' @param nrows Passed to read.delim()
  #'
  #' @return output Data frame

  read_delim_flexible <- function(file, header = TRUE, row.names = NULL, nrows = -1 ){

      ext <- tolower(tail(strsplit(basename(file), split = "\\.")[[1]], 1))

      if (ext == "tsv" || ext == "txt") {
          separator <- "\t"
      } else if (ext == "csv") {
          separator <- ","
      } else {
          stop(paste("Unknown separator for", ext))
      }

      read.delim(
          file,
          sep = separator,
          header = header,
          row.names = row.names
      )
  }

  # Set up default options

  opt <- list(
      abundance_matrix_file = 'rsem.merged.transcript_counts.assay.tsv',
      sample_file = 'samplesheet.sample_metadata.tsv',
      minimum_abundance = 1,
      minimum_samples = 1,
      minimum_proportion = 0,
      grouping_variable = NULL
  )

  opt <- parse_args('--minimum_samples 1 --minimum_abundance 1', opt)

  abundance_matrix <- read_delim_flexible(opt$abundance_matrix_file, row.names = 1)
  feature_id_name <- colnames( read_delim_flexible(opt$abundance_matrix_file, nrows = 1)[1])

  # If a sample sheet was specified, validate the matrix against it

  if (opt$sample_file != ''){

      # Read the sample sheet and check against matrix

      samplesheet <- read_delim_flexible(opt$sample_file, row.names = 1)
      missing_samples <- setdiff(rownames(samplesheet), colnames(abundance_matrix))

      if (length(missing_samples) > 0){
          stop(
              paste(
                  paste(missing_samples, collapse = ', '),
                  'not represented in supplied abundance matrix'
              )
          )
      }else{
          abundance_matrix <- abundance_matrix[,rownames(samplesheet)]
      }
  }else{

      # If we're not using a sample sheet to select columns, then at least make
      # sure the ones we have are numeric (some upstream things like the RNA-seq
      # workflow have annotation colummns as well)

      numeric_columns <- unlist(lapply(1:ncol(abundance_matrix), function(x) is.numeric(abundance_matrix[,x])))
      abundance_matrix <- abundance_matrix[,numeric_columns]
  }

  # If we want to define n based on the levels of a grouping variable...

  if ((opt$sample_file != '') && ( ! is.null(opt$grouping_variable))){

      # Pick a minimum number of samples to pass threshold based on group size

      if (! opt$grouping_variable %in% colnames(samplesheet)){
          stop(paste(opt$grouping_variable, "not in supplied sample sheet"))
      }else{
          opt$minimum_samples <- min(table(samplesheet[[opt$grouping_variable]]))
          if ( opt$minimum_proportion > 0){
              opt$minimum_samples <- opt$minimum_samples * opt$minimum_proportion
          }
      }
  }else if (opt$minimum_proportion > 0){

      # Or if we want to define it based on a static proportion of the sample number

      opt$minimum_samples <- ncol(abundance_matrix) * opt$minimum_proportion
  }

  # Generate a boolean vector specifying the features to retain

  keep <- apply(abundance_matrix, 1, function(x){
      sum(x > opt$minimum_abundance) >= opt$minimum_samples
  })

  # Write out the matrix retaining the specified rows and re-prepending the
  # column with the feature identifiers

  prefix = ifelse('study' == 'null', '', 'study')

  write.table(
      data.frame(rownames(abundance_matrix)[keep], abundance_matrix[keep,,drop = FALSE]),
      file = paste0(
          prefix,
          '.filtered.tsv'
      ),
      col.names = c(feature_id_name, colnames(abundance_matrix)),
      row.names = FALSE,
      sep = '   ',
      quote = FALSE
  )

  ################################################
  ################################################
  ## R SESSION INFO                             ##
  ################################################
  ################################################

  sink("R_sessionInfo.log")
  print(sessionInfo())
  sink()

  ################################################
  ################################################
  ## VERSIONS FILE                              ##
  ################################################
  ################################################

  r.version <- strsplit(version[['version.string']], ' ')[[1]][3]

  writeLines(
      c(
          '"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:CUSTOM_MATRIXFILTER":',
          paste('    r-base:', r.version)
      ),
  'versions.yml')

  ################################################
  ################################################
  ################################################
  ################################################

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error: ES1_48_Control-lipid_0, ES2_48_Control-lipid_14, ES3_48_Control-lipid_21, ES4_164_Control-protein_0, ES5_164_Control-protein_14, ES6_164_Control-protein_21, ES7_SEM19-1354_Control-protein_14, ES8_SEM19-1354_Control-protein_21, ES9_172_Control-protein_0, ES10_172_Control-protein_14, ES11_172_Control-protein_21, ES12_209_MG-lipid_0, ES13_209_MG-lipid_14, ES14_209_MG-lipid_21, ES15_601_Control-protein_14, ES16_601_Control-protein_21, ES17_615_Control-protein_0, ES18_615_Control-protein_14, ES19_615_Control-protein_21, ES20_618_Control-lipid_0, ES21_618_Control-lipid_14, ES22_618_Control-lipid_21, ES23_Blue-010_Control-lipid_0, ES24_Blue-010_Control-lipid_21, ES25_619_MG-lipid_0, ES26_619_MG-lipid_14, ES27_619_MG-lipid_21, ES103_625_Control-lipid_0, ES104_625_Control-lipid_14, ES105_625_Control-lipid_21, ES31_Blue-055_MG-protein_0, ES32_Blue-055_MG-protein_14, ES33_101-LS_MG-lipid_0, ES34_101-LS_MG-lipid_14, ES35_101-LS_MG-lipid_21, ES36_AR17-21_MG-lipid_0, ES37_AR17-21_MG-lipid_14,
  Execution halted

Work dir:
  diffabundance/work/ed/3225f0361d750970c0c0439ac3407c

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Relevant files

command_files.tgz

System information

Nextflow version 22.10.1
Hardware HPC
Executor slurm
Container engine: Singularity
OS Linux
Version of nf-core/differentialabundance: dev
pinin4fjords commented 1 year ago

Thanks for reporting, will probably just need check.names = FALSE somewhere, I'll fix soon.

pinin4fjords commented 1 year ago

I believe https://github.com/nf-core/differentialabundance/pull/60 will fix this for the specific module. It may also come up in later modules though, so more testing will be required.

pinin4fjords commented 1 year ago

I believe this should be addressed by recent PRs. Please reopen if you encounter similar issues with the current dev version.