Closed alexmascension closed 5 months ago
BTW, just in case, genome gtf and fasta files are downloaded according to nf-core/rnaseq guidelines, and STAR/salmon/kalisto indexes are built as in the code from the respective .nf files.
Hi! I realised that the problem might be related to the nomenclature of the transcript fasta and the gtf file.
To build kallisto and salmon indexes I used the fasta file from https://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
In this fasta the transcript id and version are shown together:
>ENST00000631435.1 cdna scaffold:GRCh38:HSCHR7_2_CTG6:809186:809197:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
However, the transcript_id and transcript_version are separated in the gtf:
1 havana exon 182696 182746 . + . gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; exon_number "1"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003759020"; exon_version "2"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
Therefore, when running kallisto or salmon, the abundace file in quants folder has the quantification of the transcripts as per the fasta file; so when loading the transcripts by tx2gene.py
, the following section of the discover_transcript_ºattribute()
fails:
with open(gtf_file) as inh:
# Read GTF file, skipping header lines
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
# Use regular expression to correctly split the attributes string
attributes_str = cols[8]
attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))
votes.update(key for key, value in attributes.items() if value in transcripts)
Because no value of the gtf follows the structure "ENSTXXXXXXX.Y".
So far, I've patched this problem by creating a new attribute in the gtf file that combines the transcript id and version.
If this problem can be replicated, I think it would be a good idea to make a more lenient discover_transcript_ºattribute()
function that allows for transcript ids with or without version.
Thank you for the report, and sorry for the delayed response.
I'm not 100% sure that a change in code is required here, with the additional complexity that would entail. I've had difficulties handing this before, because some communities use .1
etc in the actual identifiers, not as version suffixes. I think it's reasonable to expect that the transcript ID matches between the FASTA and the GTF.
You could strip the versions from the transcripts before generating your indices manually. But perhaps the easier thing to do would be to allow the pipeline to build the indices for you, and save them using --save_reference
. You don't need to supply a transcriptome FASTA, one will be made for you dynamically based on the GTF, and this will ensure that the identifier match.
I'm going to close this for now, we can reopen if further discussion is necessary (or if @drpatelh disagrees with my assessment).
Description of the bug
I'm running nf-core/rnaseq with the following command:
When I run it I get the following error:
At first I thought that something might be wrong with my gtf of fasta files. However, when I run the "same" command using STAR + salmon I don't get any error:
So I don't really know where it fails.
Command used and terminal output
No response
Relevant files
nexflow.log
System information
Nextflow version: 23.10.0 Hardware: Desktop Executor: local Container engine: docker OS: Linux Version of nf-core/rnaseq: 3.14.0