i2bc / ORFmine

ORFmine is an open-source tool for identifying and analyzing all Open Reading Frames (ORFs) in genomic data, focusing on their sequences, structures, evolution and translation activities.
https://i2bc.github.io/ORFmine/
MIT License
6 stars 1 forks source link

Cumbersome features #1

Open Proginski opened 2 years ago

Proginski commented 2 years ago

In some gff files are features that cover most of the track. For example : GCF_000247795.1 In the related gff file (enclosed), there is a feature named "match" that fully overlaps with the first chromosome NC_032650.1 RefSeq region 1 161108492 . + . ID=NC_032650.1:1..161108492;Dbxref=taxon:9915;Name=1;breed=Nelore;chromosome=1;country=Brazil;gb-synonym=Bos taurus indicus;gbkey=Src;genome=chromosome;isolate=QUIL7308;mol_type=genomic DNA;note=animal owned by Agropecuaria Quilombo Inc.;sex=male;tissue-type=peripheral blood mononuclear cells line num 37235: NC_032650.1 RefSeq match 1 161108492 . + . ID=aln0;Target=NC_032650.1 1 161108492 +;gap_count=0;num_mismatch=0;pct_coverage=100;pct_identity_gap=100

In consequence orfget is not able to define any pure intergenic ORF :

NC_032650.1

ORF type Quantity Average length (aa)


c_CDS 7649 100.45
nc_ovp_opp-CDS 19987 58.68
nc_ovp_opp-cDNA_match 201 39.65
nc_ovp_opp-match 1983772 46.8
nc_ovp_same-CDS 11740 52.03
nc_ovp_same-cDNA_match 713 39.64
nc_ovp_same-lnc_RNA 15831 42.05
nc_ovp_same-mRNA 439133 44.33
nc_ovp_same-match 2449854 46.35
nc_ovp_same-pseudogene 10750 48.33
nc_ovp_same-tRNA 16 68.0
nc_ovp_same-transcript 281 65.47

Would it be possible as a preliminary step in orftrack, to exclude features whose region coverage exceeds lets say 90% to avoid this behavior ?

Meanwhile, since the 6 only genomes with this error I identified so far, all contain a 'match' feature, I suggest to simply add 'match' to line 597 of gff_parser.py if element_type not in ['chromosome', 'region','match']:

nchenche commented 2 years ago

Hi Paul,

This is an old and resolved issue now but yes you were right.

Thanks !