Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)

alexmascension commented 9 months ago

Description of the bug

I'm running nf-core/rnaseq with the following command:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                            
-profile docker                                                                                        
--max_cpus 9                                             
--max_memory 24.GB                                            
--max_time 500.h 
--pseudo_aligner salmon 
--outdir results/test_mouse/RNASEQ_PSEUDOALIGNER 
--genome GRCm38 
--input work/test_mouse/samplesheets/RNASEQ_PSEUDOALIGNER.csv 
--skip_alignment 
--fasta database/genomes/GRCm38/genome.fasta 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--salmon_index database/indexes/GRCm38/SALMON 
--star_index database/indexes/GRCm38/STAR

When I run it I get the following error:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)'

Caused by:
  Missing output file(s) `*.tsv` expected by process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)`

Command executed:

  tx2gene.py \
      --quant_type salmon \
      --gtf genome.filtered.gtf \
      --quants quants \
      --id gene_id \
      --extra gene_name \
      -o tx2gene.tsv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Command error:
  __main__ - 2024-01-30 16:37:17,060 WARNING: No attribute in GTF matching transcripts
  __main__ - 2024-01-30 16:37:17,060 ERROR: Failed to map transcripts to genes.

Work dir:
  /data/Proyectos/NGS_pipeline/work/e2/6ad3cf2cec77f4aeb2434722052450

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

At first I thought that something might be wrong with my gtf of fasta files. However, when I run the "same" command using STAR + salmon I don't get any error:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                             
-profile docker                                                                                         
--max_cpus 9                                             
--max_memory 96.GB                                            
--max_time 500.h --aligner star_salmon 
--save_unaligned 
--save_untrimmed 
--min_mapped_reads 2 
 --input work/test_mouse/samplesheets/RNASEQ.csv 
--outdir results/test_mouse/RNASEQ 
--genome GRCm38 
--star_index database/indexes/GRCm38/STAR 
--salmon_index database/indexes/GRCm38/SALMON 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--fasta database/genomes/GRCm38/genome.fasta

So I don't really know where it fails.

Command used and terminal output

No response

Relevant files

nexflow.log

System information

Nextflow version: 23.10.0 Hardware: Desktop Executor: local Container engine: docker OS: Linux Version of nf-core/rnaseq: 3.14.0

alexmascension commented 9 months ago

BTW, just in case, genome gtf and fasta files are downloaded according to nf-core/rnaseq guidelines, and STAR/salmon/kalisto indexes are built as in the code from the respective .nf files.

alexmascension commented 9 months ago

Hi! I realised that the problem might be related to the nomenclature of the transcript fasta and the gtf file.

To build kallisto and salmon indexes I used the fasta file from https://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

In this fasta the transcript id and version are shown together:

>ENST00000631435.1 cdna scaffold:GRCh38:HSCHR7_2_CTG6:809186:809197:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]

However, the transcript_id and transcript_version are separated in the gtf:

1   havana  exon    182696  182746  .   +   .   gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; exon_number "1"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003759020"; exon_version "2"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";

Therefore, when running kallisto or salmon, the abundace file in quants folder has the quantification of the transcripts as per the fasta file; so when loading the transcripts by tx2gene.py, the following section of the discover_transcript_ºattribute() fails:

    with open(gtf_file) as inh:
        # Read GTF file, skipping header lines
        for line in filter(lambda x: not x.startswith("#"), inh):
            cols = line.split("\t")

            # Use regular expression to correctly split the attributes string
            attributes_str = cols[8]
            attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))

            votes.update(key for key, value in attributes.items() if value in transcripts)

Because no value of the gtf follows the structure "ENSTXXXXXXX.Y".

So far, I've patched this problem by creating a new attribute in the gtf file that combines the transcript id and version.

If this problem can be replicated, I think it would be a good idea to make a more lenient discover_transcript_ºattribute() function that allows for transcript ids with or without version.

pinin4fjords commented 5 months ago

Thank you for the report, and sorry for the delayed response.

I'm not 100% sure that a change in code is required here, with the additional complexity that would entail. I've had difficulties handing this before, because some communities use .1 etc in the actual identifiers, not as version suffixes. I think it's reasonable to expect that the transcript ID matches between the FASTA and the GTF.

You could strip the versions from the transcripts before generating your indices manually. But perhaps the easier thing to do would be to allow the pipeline to build the indices for you, and save them using --save_reference. You don't need to supply a transcriptome FASTA, one will be made for you dynamically based on the GTF, and this will ensure that the identifier match.

I'm going to close this for now, we can reopen if further discussion is necessary (or if @drpatelh disagrees with my assessment).

nf-core / rnaseq