timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
105 stars 21 forks source link

GTF load error #52

Closed AndrewLangvt closed 6 years ago

AndrewLangvt commented 6 years ago

I have been attempting to use Whippet to analyze a dataset for alternative splicing, but I am having difficulty getting the program to even read in my gtf.

[al2025@premise isoform]$ julia bin/whippet-index.jl --fasta /mnt/lustre/macmaneslab/al2025/isoform/genome_compare/GCF_000337935.1_Cliv_1.0_genomic.fna --gtf /mnt/lustre/macmaneslab/al2025/isoform/genome_compare/Columba_livia_1.0.gtf 
Whippet v0.8 loading and compiling... 
WARNING: Method definition count() in module Iterators at deprecated.jl:49 overwritten in module IterTools at deprecated.jl:49.
WARNING: Method definition count(Number) in module Iterators at deprecated.jl:49 overwritten in module IterTools at deprecated.jl:49.
WARNING: Method definition count(Number, Number) in module Iterators at deprecated.jl:49 overwritten in module IterTools at deprecated.jl:49.
 8.847757 seconds.
Loading GTF file...
warning: parsing line table prologue at 0x0001cd5c should have ended at 0x0001d0f0 but it ended at 0x0001ce35
warning: parsing line table prologue at 0x0000e1c8 should have ended at 0x0000e52f but it ended at 0x0000e294
warning: parsing line table prologue at 0x00003c27 should have ended at 0x00003ffc but it ended at 0x00003cf3
warning: parsing line table prologue at 0x0000a9c0 should have ended at 0x0000ace8 but it ended at 0x0000aa8c
warning: parsing line table prologue at 0x0001357a should have ended at 0x0001390c but it ended at 0x00013646
warning: parsing line table prologue at 0x00007565 should have ended at 0x00007914 but it ended at 0x00007631
ERROR: LoadError: InexactError()
 in convert(::Type{Bool}, ::UInt32) at ./bool.jl:6
 in (::Whippet.#private_add_transcript!#12{Dict{String,Whippet.GeneInfo},Dict{String,Tuple{Vararg{UInt32,N}}},Dict{String,Tuple{Vararg{UInt32,N}}},Dict{String,Tuple{Vararg{UInt32,N}}},Dict{String,Tuple{Vararg{UInt32,N}}},Dict{String,Array{Whippet.RefTx,1}},Dict{String,Float64},Dict{String,Int64}})(::SubString{String}, ::String, ::SubString{String}, ::Char, ::Array{UInt32,1}, ::Array{UInt32,1}, ::Int64) at /mnt/lustre/software/anaconda/colsa/envs/whippet-0.8/local/src/Whippet.jl/src/refset.jl:215
 in #load_gtf#11(::Bool, ::Bool, ::Function, ::IOStream) at /mnt/lustre/software/anaconda/colsa/envs/whippet-0.8/local/src/Whippet.jl/src/refset.jl:258
 in (::Whippet.#kw##load_gtf)(::Array{Any,1}, ::Whippet.#load_gtf, ::IOStream) at ./<missing>:0
 in macro expansion at /mnt/lustre/software/anaconda/colsa/envs/whippet-0.8/local/src/Whippet.jl/src/timer.jl:5 [inlined]
 in main() at /mnt/lustre/software/anaconda/colsa/envs/whippet-0.8/local/src/Whippet.jl/bin/whippet-index.jl:68
 in include_from_node1(::String) at ./loading.jl:488
 in process_options(::Base.JLOptions) at ./client.jl:265
 in _start() at ./client.jl:321
while loading /mnt/lustre/software/anaconda/colsa/envs/whippet-0.8/local/src/Whippet.jl/bin/whippet-index.jl, in expression starting on line 86

my gtf file is formatted as follows

NC_013978.1 RefSeq  exon    1   69  .   +   .   transcript_id "rna19564";
NC_013978.1 RefSeq  exon    70  1042    .   +   .   transcript_id "rna19565";
NC_013978.1 RefSeq  exon    1043    1114    .   +   .   transcript_id "rna19566";
NC_013978.1 RefSeq  exon    1115    2700    .   +   .   transcript_id "rna19567";
NC_013978.1 RefSeq  exon    2701    2774    .   +   .   transcript_id "rna19568";
NC_013978.1 RefSeq  CDS 2786    3751    .   +   0   transcript_id "gene15879"; gene_id "gene15879"; gene_name "ND1";
NC_013978.1 RefSeq  exon    3769    3839    .   +   .   transcript_id "rna19569";
NC_013978.1 RefSeq  exon    3845    3915    .   -   .   transcript_id "rna19570";
NC_013978.1 RefSeq  exon    3915    3983    .   +   .   transcript_id "rna19571";
NC_013978.1 RefSeq  CDS 3984    5023    .   +   0   transcript_id "gene15880"; gene_id "gene15880"; gene_name "ND2";
NC_013978.1 RefSeq  exon    5024    5094    .   +   .   transcript_id "rna19572";
NC_013978.1 RefSeq  exon    5096    5164    .   -   .   transcript_id "rna19573";
NC_013978.1 RefSeq  exon    5167    5239    .   -   .   transcript_id "rna19574";
NC_013978.1 RefSeq  exon    5242    5308    .   -   .   transcript_id "rna19575";
NC_013978.1 RefSeq  exon    5308    5379    .   -   .   transcript_id "rna19576";

Any insight into what is the issue here?

harpur commented 6 years ago

Hey, unless the author spots something else, I suspect this is because your gtf is the wrong format. check out /anno/gencode_hg19.v25.tsl1.gtf.gz and you'll see you're missing some lines in your gtf (transcript_id and gene_id), and your formatting of the lines is slightly off. Again, not the author so there might be another issue, but from my experience with this error it is almost certainly a mis-formatted gtf.

chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5_1"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1_1"; level 2; transcript_support_level 1; tag "basic"; havana_gene "OTTHUMG00000000961.2_1"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:11869-12227"; remap_status "full_contig";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5_1"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1_1"; level 2; transcript_support_level 1; tag "basic"; havana_gene "OTTHUMG00000000961.2_1"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:12613-12721"; remap_status "full_contig";
AndrewLangvt commented 6 years ago

@harpur thanks for the comment. I used gffread through Cufflinks to generate this GTF from a GFF3, and have compared the column formatting to some other GTFs as well. My thought was that that final column of feature aspects shouldn't have much effect in identifying splicing events. That should primarily rely on the accession number to identify scaffold or chromosome, and then indices of the actual feature. Obviously the gene and transcript IDs will allow for identification of which gene, or transcript in which the splicing is occuring, but I wouldn't think this should be required for determining splicing events and exon usage. I have somewhat of an understanding of how Whippet generates these splice graphs, though I will certainly admit it is a rough understanding at best. Can you perhaps clarify where there is some issue in my understanding?

timbitz commented 6 years ago

Hi @AndrewLangvt, I think @harpur is right here. Whippet assumes a GENCODE style GTF file, consistent with GTF2.2 format where both "gene_id" and "transcript_id" are mandatory attributes for every exon line.

That being said. I'm not sure if thats the entirety of what is going on here. Your warnings warning: parsing line table prologue at are worrisome. I've never seen them before, and they appear to be some kind of pointer related warning thrown by LLVM (see line 269).

Did you try building an index with the default supplied GTF file for hg19? Does julia test/runtests.jl in the Whippet directory run cleanly?

timbitz commented 6 years ago

OK, upon closer inspection, I am now fairly certain @harpur is correct and the fatal error is due to the requirement of exon lines to contain both the gene_id and transcript_id attributes. The error itself was not intended to be thrown for this case, but rather built out of the assumption that those two attributes would be the same type (which they would be if they were always both present).

So I apologize for this cryptic error, but it is good that it happened, otherwise your splice graphs would not have built properly. (Whippet's splice graphs are built on the gene level, using the annotated gene_id attributes to define the splice sites that belong in each splice graph)

So I propose adding a few checks in a bugfix:

(1) Explicitly checking these mandatory attributes to force GTF2.2 format.
(2) Updated documentation to make this format assumption verbose.

@AndrewLangvt, Please let me know if this solves the problem, and I would also be very open to any other suggestions that you think might help others avoid a similar problem in the future?

AndrewLangvt commented 6 years ago

@timbitz I really appreciate your timely reply. I will dig further into the GTF formatting issue; I didn't realize the gene_id and transcript_id were required attributes!

julia test/runtests.jl does execute properly, and it would seem that adjusting the format of my GTF would be the next course of action. I'll see what I can do to remedy this and let you know how it goes!

timbitz commented 6 years ago

Yah, sorry about the lack of documentation on that, I think perhaps the original GTF format was without a true specification, and so there are a lot of different types of older GTF files out there. For now, I have added a note to the documentation (in master) to reflect Whippets requirement of GTF2.2.

timbitz commented 6 years ago

Hey @AndrewLangvt, This isn't still an issue is it?

AndrewLangvt commented 6 years ago

Nope. All set now!