nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
927 stars 709 forks source link

genome requirement with --pseudo_aligner salmon and --skip_alignment #688

Open didillysquat opened 3 years ago

didillysquat commented 3 years ago

Check Documentation

I have checked the following places for your error: I have checked both of these and looked through the introduction to see which steps might require the genome.

Description of the bug

When running the pipeline with --pseudo_aligner salmon --skip_alignment and providing a valid --transcript_fasta and --salmon_index but not providing --fasta or --genome, the pipeline will not run requesting that I provide a genome file: Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line: nextflow run nf-core/rnaseq --input woltering_samplesheet.csv --pseudo_aligner salmon --skip_alignment --transcript_fasta ../athal_transcriptome/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz --salmon_index ../athal_transcriptome/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz.index -profile docker
  2. See error: Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.

Expected behaviour

I would expect this specific route of the pipeline to be able to run without access to a genome, as running quantification with Salmon on the command line I need only provide the transcript fasta and the index. I've asked to skip allignments (that would otherwise require the genome), but which other step in the pipeline is the genome required for?

I would hope that the pipeline could run without access to the genome.

Log files

nextflow.log

Have you provided the following extra information/files:

System

Nextflow Installation

version 21.04.1

Container engine

Additional context

didillysquat commented 3 years ago

I see now that it is required for the DESeq2 QC that is performed downstream of the salmon pseudo quantification.

drpatelh commented 3 years ago

Hi @didillysquat ! Apologies for the late response. I am holiday at the mo.

It's actually required to build the decoy sequences for the Salmon index. If you have a genome fasta available I believe it's advisable to build the index with both the genome fasta and transcriptome fasta. I discussed this with @rob-p whilst adding Salmon support here.

Maybe we should also add support for instances where the genome fasta isn't available though as this issue highlights that particular edge case.

didillysquat commented 3 years ago

Hi @drpatelh,

There is no hurry on this at all so please don't disrupt your holidays on my behalf.

For my particular case I'm using your wonderful pipeline as a quick but clean way to get a set of salmon pseudo quantification files from RNA-seq reads that I can then import into DESeq2.

I'm sure you're far more knowledgable about this than I am but I was simply following the guidance of the salmon tutorial which worked with only an indexed transcriptome fasta (i.e. no genome). For this particular use case, it could perhaps be useful for the pipeline to detect that neither --genome nor --fasta have been provided and so limit the output accordingly (i.e. no DESeq QC) but provide a warning saying that it is doing so. (I.e. it could say "no genome provided so skipping XXX").

Having said that, one extremely useful output from your pipeline (after running it providing the --genome information) is the txt2gene.txt file (called 'salmon_txt2gene.txt' in your pipeline) that maps the transcript IDs to the genes and allows the import of the salmon counts to DESeq2 using tximport. If appropriate, it could be useful to provide this in the main salmon output directory.

Thanks for your continued efforts!

drpatelh commented 3 years ago

Hi @didillysquat ! I was going to have a go at adding this feature for the 3.4 release but it will take quite a bit of refactoring so maybe we can it in 3.5.

I have, however added the functionality for the pipeline to be able to publish the salmon_tx2gene.txt files in the salmon counts directory here.

didillysquat commented 3 years ago

@drpatelh Super! Many thanks for that.