vipints / GFFtools-GX

gfftools - Galaxy toolshed repository
BSD 3-Clause "New" or "Revised" License
15 stars 16 forks source link

gtf -> gff conversion truncates chromosome names #1

Closed dwinter closed 8 years ago

dwinter commented 8 years ago

Thanks very much for making these tools, which have helped me greatly working between different builds/tools/versions of data.

I've just run into a surprising bug. When I convert a gtf file to a gff some of the the long chromosome names are tuncated. An example gtf:

chrUn_AAWZ02036000  anoCar2_ensGene stop_codon  3899    3901    0.000000    -   .   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene CDS 3902    5131    0.000000    -   0   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene exon    3899    5131    0.000000    -   .   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene CDS 25336   25522   0.000000    -   1   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene exon    25336   25522   0.000000    -   .   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene CDS 25602   26479   0.000000    -   0   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene start_codon 26477   26479   0.000000    -   .   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036000  anoCar2_ensGene exon    25602   26479   0.000000    -   .   gene_id "ENSACAT00000000307.2"; transcript_id "ENSACAT00000000307.2"; 
chrUn_AAWZ02036001  anoCar2_ensGene stop_codon  1674    1676    0.000000    -   .   gene_id "ENSACAT00000001077.3"; transcript_id "ENSACAT00000001077.3"; 
chrUn_AAWZ02036001  anoCar2_ensGene CDS 1677    1805    0.000000    -   0   gene_id "ENSACAT00000001077.3"; transcript_id "ENSACAT00000001077.3";

gives rise to

./gtf_to_gff.py test.gtf 
##gff-version 3
chrUn_AAWZ02036 anoCar2_ensGene gene    1674    1805    .   -   .   ID=ENSACAT00000001077.3;Name=ENSACAT00000001077.3
chrUn_AAWZ02036 anoCar2_ensGene mRNA    1677    1805    .   -   .   ID=Transcript:ENSACAT00000001077.3;Parent=ENSACAT00000001077.3
chrUn_AAWZ02036 anoCar2_ensGene CDS 1674    1805    .   -   0   Parent=Transcript:ENSACAT00000001077.3
chrUn_AAWZ02036 anoCar2_ensGene three_prime_UTR 1677    1805    .   -   .   Parent=Transcript:ENSACAT00000001077.3
chrUn_AAWZ02036 anoCar2_ensGene exon    1677    1805    .   -   .   Parent=Transcript:ENSACAT00000001077.3
chrUn_AAWZ02036 anoCar2_ensGene gene    3899    26479   .   -   .   ID=ENSACAT00000000307.2;Name=ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene mRNA    3899    26479   .   -   .   ID=Transcript:ENSACAT00000000307.2;Parent=ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene CDS 3899    5131    .   -   0   Parent=Transcript:ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene CDS 25336   25522   .   -   1   Parent=Transcript:ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene CDS 25602   26479   .   -   0   Parent=Transcript:ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene exon    3899    5131    .   -   .   Parent=Transcript:ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene exon    25336   25522   .   -   .   Parent=Transcript:ENSACAT00000000307.2
chrUn_AAWZ02036 anoCar2_ensGene exon    25602   26479   .   -   .   Parent=Transcript:ENSACAT00000000307.2

(note the final three digits of each "chromosome" (actually contig) name is missing. This doesn't seem to effect sequences with smaller names.

vipints commented 8 years ago

The defined default length of a chromosome number was 15 and I updated that to 25 in the internal struct to handle the gene informations. Thanks for reporting this. I pushed this changes and could you please pull it. I tested on a testdata set. Let me know if it didn't fix.

dwinter commented 8 years ago

Perfect -- many thanks for you quick response. Can confirm it works on a complete gtf with these long names

vipints commented 8 years ago

Fixed -