gx-health / TAGET

MIT License
5 stars 3 forks source link

load gtf annotation: IndexError: list index out of range #6

Open Ural-Yunusbaev opened 8 months ago

Ural-Yunusbaev commented 8 months ago

Hi

I am getting an error in the load gtf annotation stage:

python /homes/ural/soft/TAGET/TransAnnot.py -f $rna_iso_seq -g genome.fa -o $outDir -a $genomeGFF -p $cpu --use_hisat2 0 run minimap2 find minimap2 result, use it! load gtf annotation Traceback (most recent call last): File "/homes/ural/soft/TAGET/TransAnnot.py", line 83, in main() File "/homes/ural/soft/TAGET/TransAnnot.py", line 73, in main DB = gtf2db.dict_make(config['GTF_ANNOTATION']) File "/homes/ural/soft/TAGET/gtf2db.py", line 17, in dict_make context_parse(line) File "/homes/ural/soft/TAGET/gtf2db.py", line 59, in context_parse gene_id = re_find_keyword(line, 'gene_name') # gene_id # for ENSEMBL & NCBI & GENCODE File "/homes/ural/soft/TAGET/gtf2db.py", line 94, in re_find_keyword return re.findall(keyword + ' "(.*?)"', strings)[0] IndexError: list index out of range

################ Here is gtf file:

head -5 $genomeGFF Chr1M Liftoff gene 1321858 1322094 . + . ID=TraesCS1B03G0001600;previous_id=TraesCS1B02G000600;Name=TraesCS1B03G0001600;primconf=HC;cds=CDS_OK;mapping=fullPerfectMatch;coverage=0.323;sequence_ID=0.317;valid_ORFs=0;extra_copy_number=0;copy_num_ID=TraesCS1B03G0001600_0;partial_mapping=True;low_identity=True Chr1M Liftoff mRNA 1321858 1322094 . + . ID=TraesCS1B03G0001600.1;Parent=TraesCS1B03G0001600;Note=TraesCS1B01G000600;secconf=HC1;Name=TraesCS1B03G0001600.1;primconf=HC;cds=CDS_OK;mapping=fullPerfectMatch;previous_id=TraesCS1B02G000600.1;matches_ref_protein=False;valid_ORF=False;missing_start_codon=True;extra_copy_number=0 Chr1M Liftoff exon 1321858 1322094 . + . ID=TraesCS1B03G0001600.1.exon2;Parent=TraesCS1B03G0001600.1;extra_copy_number=0 Chr1M Liftoff CDS 1321858 1322090 . + . ID=TraesCS1B03G0001600.1.CDS2;Parent=TraesCS1B03G0001600.1;extra_copy_number=0 Chr1M Liftoff three_prime_UTR 1322091 1322094 . + . ID=TraesCS1B03G0001600.1.utr3p1;Parent=TraesCS1B03G0001600.1;extra_copy_number=0

bingqiWu commented 6 months ago

@Ural-Yunusbaev I met the same problem. If your gtf annotation was downloaded from ENSEMBL, you should comment out "gene_id = re_find_keyword(line, 'gene_name') # gene_id # for ENSEMBL & NCBI & GENCODE", and use the line" gene_biotype = re_find_keyword(line, 'gene_biotype') # for ENSEMBL" in the file"gtf2db.py"

and vice versa. (NCBI / GENCODE)