pachterlab / kma

Keep Me Around: Intron Retention Detection
GNU General Public License v2.0
29 stars 20 forks source link

Pre-processing with generate_introns.py failing #7

Open skasowitz opened 8 years ago

skasowitz commented 8 years ago

When generate_introns.py begins grouping transcripts by gene I receive the following:

Traceback (most recent call last):
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 154, in <module>
    main()
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 104, in main
    g2t = intron_ops.reduce_to_gene(gtf_list)
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/intron_ops.py", line 45, in reduce_to_gene
    cur_gene = trans.gene_id
AttributeError: 'NoneType' object has no attribute 'gene_id'

I am using genome and annotations from Ensembl. Looking back at the previous part of this processing I see an occassional line reading 'NoneType' object has no attribute 'add_exon' which I assume are the same objects causing the intron_ops exit.

slebedeva commented 8 years ago

I have the same issue using danRer10 Ensembl annotation and genome. The working example provided in the vignette worked fine with my settings.

fgypas commented 8 years ago

Similar issue here. Is it possible to provide us (or describe how to get) the annotation that you used in this publication http://nar.oxfordjournals.org/content/44/2/838.long ? Thank you in advance.

jthuff commented 8 years ago

Basically, same exact problem trying to use Arabidopsis TAIR10/Araport11 data

fgypas commented 8 years ago

In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.

For example in this case: ">chr1 1" i just kept ">chr1"

jthuff commented 8 years ago

In my case at the end I used the gencode annotation where I renamed the genome file to contain only the chromosome names without any suffixes.

Thanks for the suggestion! I got rid of extra stuff on each chromosome > line as @fgypas suggests, i.e. >Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02 to >Chr1

But it still didn't work...

Then I tried switching around the transcript_id and gene_id fields in column 9 of the GTF, i.e. Chr1 Araport11 5UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"; to Chr1 Araport11 5UTR 3631 3759 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1";

And then it worked! Either having the extra annotations on the FASTA description lines or having transcript_id followed by gene_id in the GTF caused failure, but gave different errors. I finally got it to work by both cleaning up the FASTA description lines (as @fgypas suggests) AND swapping the GTF column 9 fields to gene_id followed by transcript_id.

stephenfloor commented 7 years ago

This is caused by a failure of gtf_parser.py to correctly process Ensembl-formatted GTFs (or GTFs in general). The attribute field of GTFs has no inherent order, but gtf_parser.py is assuming an order. This can be fixed by replacing this code block around line 220:

            transcript_id = (                                                                                                                
                gtf_line[8].split(";")[1].split(" ")[2].replace(                                                                             
                    "\"",                                                                                                                    
                    ""))  

with this:

            for attr in gtf_line[8].split(";"):                                                                                               
                attr_name = attr.strip().split(" ")[0]                                                                                        
                if attr_name == "transcript_id":                                                                                              
                    transcript_id = attr.strip().split(" ")[1].replace(                                                                       
                        "\"",                                                                                                                 
                        "") 
                    break

A similar fix can be applied later in gtf_parser.py to the "gene_id" attribute using the same code block but replacing the attr_name test with "gene_id" instead of "transcript_id".

pimentel commented 7 years ago

thanks, @stephenfloor ! I'm going to be making a bunch of changes in the coming days and hopefully fixing most of the reported bugs.

really sorry to those who have been waiting for a while. I've been quite busy with other projects.