jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
504 stars 247 forks source link

AttributeError: 'Gene' object has no attribute 'add_CDS' #62

Closed PlantDr430 closed 5 years ago

PlantDr430 commented 5 years ago

Hello, I am trying to get intron and exon statistics using both your 'report_gff3_statistics.py' and 'report_gff_intron_and_intergenic_stats.py' and I am getting the AttributeError that is in the title.

stephenwyka@bspmgenomics:/data/wyka/Reference_genomes/originals$ /data/wyka/report_gff3_statistics.py -i Claviceps_purpurea_20_1.gff -o exon_report.txt
Traceback (most recent call last):
  File "/data/wyka/report_gff3_statistics.py", line 110, in <module>
    main()
  File "/data/wyka/report_gff3_statistics.py", line 30, in main
    (assemblies, features) = gff.get_gff3_features(args.input_file)
  File "/data/wyka/biocode/lib/biocode/gff.py", line 350, in get_gff3_features
    parent_feat.add_CDS(CDS)
AttributeError: 'Gene' object has no attribute 'add_CDS'

I downloaded this gff3 from GenBank and below is an example of the contents.

CAGA01000191.1  EMBL    region  1   224490  .   +   .   ID=id0;Dbxref=taxon:1111077;clone=scaffold00051;gbkey=Src;mol_type=genomic DNA;strain=20.1
CAGA01000191.1  EMBL    gene    3223    3902    .   -   .   ID=gene0;Name=CPUR_06801;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06801
CAGA01000191.1  EMBL    CDS 3642    3902    .   -   0   ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1  EMBL    CDS 3223    3315    .   -   0   ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1  EMBL    exon    3223    3315    .   -   .   ID=id1;Parent=gene0;gbkey=exon
CAGA01000191.1  EMBL    exon    3642    3902    .   -   .   ID=id2;Parent=gene0;gbkey=exon
CAGA01000191.1  EMBL    gap 7156    7946    .   +   .   ID=id3;estimated_length=791;gbkey=gap
CAGA01000191.1  EMBL    gene    11485   11880   .   +   .   ID=gene1;Name=CPUR_06802;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06802
CAGA01000191.1  EMBL    CDS 11485   11880   .   +   0   ID=cds1;Parent=gene1;Dbxref=NCBI_GP:CCE35374.1;Name=CCE35374.1;Note=CP_06802.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35374.1
CAGA01000191.1  EMBL    exon    11485   11880   .   +   .   ID=id4;Parent=gene1;gbkey=exon
CAGA01000191.1  EMBL    gene    11895   12257   .   -   .   ID=gene2;Name=CPUR_06803;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06803
CAGA01000191.1  EMBL    CDS 11895   12257   .   -   0   ID=cds2;Parent=gene2;Dbxref=NCBI_GP:CCE35375.1;Name=CCE35375.1;Note=CP_06803.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35375.1
CAGA01000191.1  EMBL    exon    11895   12257   .   -   .   ID=id5;Parent=gene2;gbkey=exon
CAGA01000191.1  EMBL    gene    13574   15125   .   -   .   ID=gene3;Name=CPUR_06804;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06804
CAGA01000191.1  EMBL    CDS 14956   15125   .   -   0   ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1  EMBL    CDS 14507   14850   .   -   1   ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1  EMBL    CDS 14135   14454   .   -   2   ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1  EMBL    CDS 13822   14062   .   -   0   ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1  EMBL    CDS 13574   13758   .   -   2   ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1  EMBL    exon    13574   13758   .   -   .   ID=id6;Parent=gene3;gbkey=exon
CAGA01000191.1  EMBL    exon    13822   14062   .   -   .   ID=id7;Parent=gene3;gbkey=exon
CAGA01000191.1  EMBL    exon    14135   14454   .   -   .   ID=id8;Parent=gene3;gbkey=exon
CAGA01000191.1  EMBL    exon    14507   14850   .   -   .   ID=id9;Parent=gene3;gbkey=exon
CAGA01000191.1  EMBL    exon    14956   15125   .   -   .   ID=id10;Parent=gene3;gbkey=exon
jorvis commented 5 years ago

Yeah, NCBI is putting out some strange GFF here which has exon/CDS features which are directly associated with a gene rather than any type of RNA. This goes against what model organism databases (from which GFF sprung) have done, but I suppose GBK and Ensembl are big enough that we'll have to modify to handle whatever they export, even if it is incorrect.

jorvis commented 5 years ago

Hmm, checked their format doc and it looks like within there your file also validates what they say they accept. Their own documentation holds the gene -> RNA -> exon/CDS parentage.

https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#formatting

jorvis commented 5 years ago

Closing this since it appears to be an issue with the file not being in NCBI's published format.

PlantDr430 commented 5 years ago

Hmm okay. It is an older file, so perhaps they had different formats back then and recently updated their standards.

On Sun, Jul 7, 2019 at 8:38 PM Joshua Orvis notifications@github.com wrote:

Closing this since it appears to be an issue with the file not being in NCBI's published format.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jorvis/biocode/issues/62?email_source=notifications&email_token=AHB5CPZ2JFZQAHNGOEKB6GTP6KSD7A5CNFSM4H6NF3A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLZ5IY#issuecomment-509058723, or mute the thread https://github.com/notifications/unsubscribe-auth/AHB5CPZVZV5LCPV2X7EZGXLP6KSD7ANCNFSM4H6NF3AQ .