Request: documentation of required gtf format for kallisto quant --genomebam option

SolKatzman commented 1 year ago

kallisto v0.48.0

The documentation for the kallisto quant --genomebam option states that a gtf file is required, but doesn't specify anything about the contents of that gtf file. For example, which record types are required; are there any ordering requirements among the records?

I have a gtf file containing only "exon" type records. Each such record specifies the "gene_id" and "transcript_id" For example:

chrI tmp.gp exon 87286 87387 . + . gene_id "SNC1"; transcript_id "SNC1"; exon_number "1"; exon_id "SNC1.1"; chrI tmp.gp exon 87501 87752 . + . gene_id "SNC1"; transcript_id "SNC1"; exon_number "2"; exon_id "SNC1.2";

Now this file can be loaded and correctly displayed (with transcript_id annotation) on the UCSC Genome Browser.

Indeed, to generate a kallisto target index, I loaded that gtf file, then used the Table Browser to extract the transcript sequences into a fasta file containing the sequences for each transcript_id.

But if I use the above gtf file as the --gtf option with --genomebam, the output file (pseudoalignments.bam) has no coordinates. That is, all bam records start like this:

E00503:224:HC333CCX2:2:1203:24160:2979 67 * 0 0

When using --pseudobam, the transcript_ids in the alignment records, as well as in the '@SQ' header lines, match the transcript_ids in the gtf, which shows that the fasta and kallisto index are in agreement with the gtf definitions.

There have been various questions about this in the forum, but using trial and error to figure out what should be in the gtf could be averted with some additional documentation. I still don't know what is missing/incorrect.

Thanks for your attention, /Sol Katzman

SolKatzman commented 1 year ago

Perhaps I can clarify with a bit more detail regarding one (paired-end) read and 2 isoforms of one gene.

Here are the gtf entries for the SNC1 (2 exons) and SNC1_2 (single exon) isoforms of the SNC1 gene:

chrI tmp.gp exon 87286 87387 . + . gene_id "SNC1"; transcript_id "SNC1"; exon_number "1"; exon_id "SNC1.1"; chrI tmp.gp exon 87501 87752 . + . gene_id "SNC1"; transcript_id "SNC1"; exon_number "2"; exon_id "SNC1.2"; chrI tmp.gp exon 87286 87752 . + . gene_id "SNC1"; transcript_id "SNC1_2"; exon_number "1"; exon_id "SNC1_2.1";

And here are the results for a single paired end read in

pseudobam: E00503:224:HC333CCX2:2:2107:17178:38456 339 SNC1_2 1 255 29S91M = 1 -145 E00503:224:HC333CCX2:2:2107:17178:38456 419 SNC1_2 1 255 54S66M = 1 145 E00503:224:HC333CCX2:2:2107:17178:38456 83 SNC1 1 255 29S91M = 1 -145 E00503:224:HC333CCX2:2:2107:17178:38456 163 SNC1 1 255 54S66M = 1 145

genomebam: E00503:224:HC333CCX2:2:2107:17178:38456 67 * 0 0 120M * 0 0 E00503:224:HC333CCX2:2:2107:17178:38456 131 * 0 0 120M * 0 0

SolKatzman commented 1 year ago

I have done various experiments to determine what is required in the gtf. Here is my best estimate of how kallisto processes the gtf in order to generate the genomebam. Perhaps the authors can confirm this.

It seems like kallisto looks for "transcript type" records and then scans the subsequent "exon type" records until the next "transcript type" record. It does not look at the gtf "attributes" of the exons, specifically not relying on the attributes such as:

"transcript_id", "exon_number", "exon_id".

I will start other posts to indicate how this can cause various errors in the output genomebam. Although it might take too much effort to change the program, I feel that the program should at least be aware of when it is in danger of producing erroneous output and then crash with an error message.

Thank you for your attention. Sol Katzman UCSC Genomics Institute

Yenaled commented 1 year ago

Yes, the GTF file needs to have transcript records. Only having exon records will not work.

As for fixing kallisto's issue to better handle GTF input (or better documentation of what's necessary in the GTF file), I agree that it would be useful for users of the genomebam feature. That said, we haven't really been actively maintaining genomebam. It was not usable in 0.46.2 but we added it back in 0.48.0 and we have been discussing whether to keep it in the upcoming 0.49.0 (though we likely will). We still consider it an auxillary function of kallisto since kallisto is mostly designed for probabilistic assignment, a model that cannot be used with BAM files. I'm hoping to develop a better solution to producing BAM files (rather than just looking at where mapping k-mers exist along transcripts), integrating the output from BUS files (which is now our standard output) -- in which case I'll update the GTF parser.

Currently, the GTF parsing works with standard GTF files you'd download online from, say, ensembl -- it's just not very good at error-handling, especially for custom GTF files.

Anyway, just wanted to be honest with you about the state of genomebam.

SolKatzman commented 1 year ago

I appreciate your frank comments about the support for the genomebam feature. I had not used kallisto for several years, so I was excited to see the release notes for v.044.0 which highlighted "BAM!" (and which is why I really would appreciate release note updates on the download page as I noted in issue #366 )

In the same frank spirit, I confess that I have been reluctant to use kallisto without a feature such as genomebam because the PIs that I work for are generally much more comfortable if they can visualize their data on the genome browser. I typically convert STAR bam files into bigWig coverage tracks (using bedtools genomecov). This allows us to look for any anomalies in expected patterns of rnaseq expression. Without such visualization, we can only view kallisto as a black box that might or might not be producing valid results.

Aside: I discovered and debugged the issues in kallisto processing of my gtf (issues #371 and #372) when I was converting genomebam to a coverage track. By good fortune, (a) I run separate jobs for the plus and minus strand coverage tracks and (b) bedtools took 36 hours to process the minus strand due to the erroneous bam (issue #372), and only 1/2 hour for the plus strand.

Having said all that, I recognize that kallisto may be mostly intended to speed up processing of hundreds or thousands of deeply sequenced samples in common model species. My projects typically involve only moderate sequencing depth in tens of samples in less well-trod species (with occasional idiosyncratic gene-models!) so it might not be worth your while to support my use case. Of course, complete documentation (and assertion checking) is always a good idea.

Thanks for your attention.

Sol Katzman UCSC Genomics Institute

Yenaled commented 1 year ago

That's fair, and I do hope to work on an improved genomebam-like feature in the near future, because I actually do need visual tracks for some of my projects (e.g. ribosome profiling). I'm currently working on getting version 0.49.0 released (which involves some major changes to the kallisto engine) and, after that, I'll work on better documentation as well.

Thanks for raising these issues -- I'll leave them open because it gives me some things to work on when I do get around to improving how kallisto generates coverage tracks. Likely, while we'll keep genomebam in (as genomebam is useful to see where the mapped k-mers come from), I'm hoping for a complete overhaul of how kallisto generates coverage tracks.

pachterlab / kallisto

Request: documentation of required gtf format for kallisto quant --genomebam option #368