pmelsted / pizzly

Fast fusion detection using kallisto
BSD 2-Clause "Simplified" License
80 stars 10 forks source link

Error, could not find any transcript sequences #16

Open ndaniel opened 7 years ago

ndaniel commented 7 years ago

Hello,

when using Pizzly 0.37.3 (SeqAn 2.2.0) and Kallisto 0.43.1 with Ensembl 81 and one gets this error message from Pizzly:

Error, could not find any transcript sequences check that the ids in the FASTA file and GTF file match

when running this:

wget ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_1.fq.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_2.fq.gz

kallisto index -k 31 -i transcriptome.idx  Homo_sapiens.GRCh38.cdna.all.fa.gz

kallisto quant -i transcriptome.idx  -o output_kallisto --fusion  reads_1.fq.gz reads_2.fq.gz

pizzly -k 31 --gtf Homo_sapiens.GRCh38.81.gtf.gz --cache cache.txt  --align-score 2 --insert-size 400 --fasta Homo_sapiens.GRCh38.cdna.all.fa.gz --output output_pizzly output_kallisto/fusion.txt

which worked just fine with: pizzly version: 0.37.1, SeqAn version: 2.2.0, kallisto 0.43.1 (as shown here: https://github.com/pmelsted/pizzly/issues/7 )

johanneskoester commented 5 years ago

We have the same problem here. @pmelsted any plans on fixing this?

RaqManzano commented 4 years ago

3 years later of this issue and I encountered the same problem. Are there any plans with pizzly @pmelsted or is this project finished?

mkabza commented 3 years ago

I've encountered this problem today and it seems to stem from the fact that Ensembl GTF files now have gene_version and transcript_version tags that are added to feature identifiers by Pizzly. They can be easily removed with sed:

zcat ensembl.gtf.gz | sed -r 's/(gene|transcript)_version "([0-9]+)";//g' > ensembl.gtf

The uncompressed GTF file can now be used with Pizzly

sbamin commented 3 years ago

@pmelsted I tried @mkabza's workaround but still failing to get pizzly working with ensembl gtf. I even tried older gtf from version 87 and v95. I believe issue board is inactive for several months now. Unless pizzly is not under active development, please consider addressing this issue. Thanks!

jowkar commented 2 years ago

It seems that the transcriptome GTF and FASTA files both need to contain transcript version numbers. If only the GTF includes these, but not the FASTA, then the fusion.txt file generated by kallisto contains transcript names without version numbers. However, pizzly (0.37.3) creates the cache by essentially reformatting the GTF, and in the process of doing so appends transcript versions to the end of the transcript names, causing a mismatch to the kallisto output.

One possible workaround is to take the cache generated by a (failed) run of pizzly and remove transcript versions from it to create a new cache file compatible with the kallisto input. For instance, first run pizzly:

#!/usr/bin/env bash
pizzly \
    -k 31 \
    --gtf "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf" \
    --cache index.cache.txt \
    --insert-size 400 \
    --fasta "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.fa" \
    --output "test" \
    "fusion.txt"

Then reformat the cache file (R code):

#!/usr/bin/env Rscript
x <- read.table(index.cache.txt,sep="\t")

x$V2 <- gsub("\\.[0-9]","",x$V2)
x$V9[x$V1=="GENE"] <- gsub("\\.[0-9]","",x$V9[x$V1=="GENE"])
x$V3[x$V1=="TRANSCRIPT"] <- gsub("\\.[0-9]","",x$V3[x$V1=="TRANSCRIPT"])

write.table(x=x, file = index.cache.fixed.txt, quote = F,sep="\t",row.names = F,col.names = F)

Then run pizzly with this new cache instead:

#!/usr/bin/env bash
pizzly \
    -k 31 \
    --gtf "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf" \
    --cache index.cache.fixed.txt \
    --insert-size 400 \
    --fasta "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.fa" \
    --output "final" \
    "fusion.txt"

Although the more elegant solution is probably to rename transcript names in the input FASTA file (to include version numbers) and rerun kallisto.