pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
650 stars 171 forks source link

Does -gtf support gff3? #332

Open outpaddling opened 2 years ago

outpaddling commented 2 years ago

I've been unable to find any documentation or threads about whether GTF and GFF3 files are interchangeable with the -gtf flag. I would guess they are if kallisto is using an external library to read them.

If that's the case, this should probably be documented. The manual currently states that you need a GTF file and doesn't say whether or not a GFF3 will work in its place.

outpaddling commented 2 years ago

I ran an alignment using

--gtf=Data/3-reference/Mus_musculus.GRCm$build.$release.chr.gff3

and it seemed to work fine. If there are no issues with doing this that I'm not aware of, perhaps someone could sub "GTF" -> "GTF or GFF3" to the documentation.

outpaddling commented 2 years ago

Actually, it turned out I had a corrupt chromosome lengths TSV file, which led to invalid BAM headers. Somehow with the corrupted TSV file, kallisto was able to run to completion with --gtf pointing to a GFF3, though the pseudoalignments.bam files are unusable. Since I fixed the TSV, kallisto aborts trying to read a GFF3. Apparently two wrongs almost make a right. So at this point I would recommend explicitly stating in the docs that GFF3 files are not (yet) supported. It would be nice to be able to do everything with the GFF3, since it's the newer format. Right now I need the GTF for kallisto quant and get the chromosome sizes from the GFF3, since chromosomes are not listed as features in the GTF.

pmelsted commented 2 years ago

While GTF and GFF3 are mostly the same there is a difference in how the free format field of attributes is formatted and used. kallisto currently only supports GTF files, if you are using a standard annotation like ensembl, you can use the GTF link to download rather than gff3. For custom annotations you can convert them https://www.biostars.org/p/45791/#90168

Can you show or send the misformatted chromosome length text file so I can fix the issue?

outpaddling commented 2 years ago

Thanks for the update.

The old script generating the chromosomes TSV was a temporary hack based on a GTF header from 2019, which has also now changed so that script no longer works.

Coincidentally, just today I added a permanent solution to biolibc-tools that generates the TSV file directly from the reference genome, so the GFF3 is longer needed:

https://github.com/auerlab/biolibc-tools/blob/main/chrom-lens.c

Using the reference fasta instead of a hopefully compatible GFF eliminates a source of errors and simplifies the pipeline.