Open ndaniel opened 7 years ago
We have the same problem here. @pmelsted any plans on fixing this?
3 years later of this issue and I encountered the same problem. Are there any plans with pizzly @pmelsted or is this project finished?
I've encountered this problem today and it seems to stem from the fact that Ensembl GTF files now have gene_version and transcript_version tags that are added to feature identifiers by Pizzly. They can be easily removed with sed:
zcat ensembl.gtf.gz | sed -r 's/(gene|transcript)_version "([0-9]+)";//g' > ensembl.gtf
The uncompressed GTF file can now be used with Pizzly
@pmelsted I tried @mkabza's workaround but still failing to get pizzly working with ensembl gtf. I even tried older gtf from version 87 and v95. I believe issue board is inactive for several months now. Unless pizzly is not under active development, please consider addressing this issue. Thanks!
It seems that the transcriptome GTF and FASTA files both need to contain transcript version numbers. If only the GTF includes these, but not the FASTA, then the fusion.txt file generated by kallisto contains transcript names without version numbers. However, pizzly (0.37.3) creates the cache by essentially reformatting the GTF, and in the process of doing so appends transcript versions to the end of the transcript names, causing a mismatch to the kallisto output.
One possible workaround is to take the cache generated by a (failed) run of pizzly and remove transcript versions from it to create a new cache file compatible with the kallisto input. For instance, first run pizzly:
#!/usr/bin/env bash
pizzly \
-k 31 \
--gtf "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf" \
--cache index.cache.txt \
--insert-size 400 \
--fasta "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.fa" \
--output "test" \
"fusion.txt"
Then reformat the cache file (R code):
#!/usr/bin/env Rscript
x <- read.table(index.cache.txt,sep="\t")
x$V2 <- gsub("\\.[0-9]","",x$V2)
x$V9[x$V1=="GENE"] <- gsub("\\.[0-9]","",x$V9[x$V1=="GENE"])
x$V3[x$V1=="TRANSCRIPT"] <- gsub("\\.[0-9]","",x$V3[x$V1=="TRANSCRIPT"])
write.table(x=x, file = index.cache.fixed.txt, quote = F,sep="\t",row.names = F,col.names = F)
Then run pizzly with this new cache instead:
#!/usr/bin/env bash
pizzly \
-k 31 \
--gtf "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf" \
--cache index.cache.fixed.txt \
--insert-size 400 \
--fasta "/data/bin/bcbio/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.fa" \
--output "final" \
"fusion.txt"
Although the more elegant solution is probably to rename transcript names in the input FASTA file (to include version numbers) and rerun kallisto.
Hello,
when using Pizzly 0.37.3 (SeqAn 2.2.0) and Kallisto 0.43.1 with Ensembl 81 and one gets this error message from Pizzly:
when running this:
which worked just fine with: pizzly version: 0.37.1, SeqAn version: 2.2.0, kallisto 0.43.1 (as shown here: https://github.com/pmelsted/pizzly/issues/7 )