labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
241 stars 28 forks source link

len(contig) == NoneType #299

Open StefanFrankBio opened 6 days ago

StefanFrankBio commented 6 days ago

Hey PPanGGOLiN-Team,

I get the below when running ppanggolin all version 2.1.2. I want to make the switch from version 2.0.5, where analysis of the same input data works fine.

Traceback (most recent call last):
  File "/Users/stefanfrank/Projects/PanAdapt/work/conda/ppanggolin-3b3a97db1cb8123576ebce8a9e97f31b/lib/python3.12/site-packages/ppanggolin/annotate/annotate.py", line 1163, in read_anno_file
    org, has_fasta = read_org_gff(
                     ^^^^^^^^^^^^^
  File "/Users/stefanfrank/Projects/PanAdapt/work/conda/ppanggolin-3b3a97db1cb8123576ebce8a9e97f31b/lib/python3.12/site-packages/ppanggolin/annotate/annotate.py", line 956, in read_org_gff
    correct_putative_overlaps(org.contigs)
  File "/Users/stefanfrank/Projects/PanAdapt/work/conda/ppanggolin-3b3a97db1cb8123576ebce8a9e97f31b/lib/python3.12/site-packages/ppanggolin/annotate/annotate.py", line 1096, in correct_putative_overlaps
    if gene.stop > len(contig):
                   ^^^^^^^^^^^
TypeError: 'NoneType' object cannot be interpreted as an integer

This is an example of the gff files provided.

##gff-version 3
# Liftoff v1.6.3
# /Users/stefanfrank/Projects/PanAdapt/work/conda/liftoff-f0c55c74ebf8bf74c2ec2ca40350a2c8/bin/liftoff -db /Users/stefanfrank/Projects/PanAdapt/work/90/2a19bed407a1fa28bbd5a0b0031a09/NC_045512_2.db -o NC_045512_2.gff -f types.tmp -cds -polish NC_045512_2.fasta /Users/stefanfrank/Projects/data/SC2/NC_045512_2.fasta
NC_045512.2 Liftoff region  1   29903   .   +   .   ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=NC_045512.2:1..29903_0
NC_045512.2 Liftoff gene    266 21555   .   +   .   ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01;coverage=1.0;sequence_ID=1.0;matches_ref_protein=False;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-GU280_gp01_0
NC_045512.2 Liftoff mature_protein_region_of_CDS    266 805 .   +   .   ID=id-YP_009724389.1:1..180;Note=nsp1; produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=leader protein;protein_id=YP_009725297.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    806 2719    .   +   .   ID=id-YP_009724389.1:181..818;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp2;protein_id=YP_009725298.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    2720    8554    .   +   .   ID=id-YP_009724389.1:819..2763;Note=former nsp1; conserved domains are: N-terminal acidic (Ac), predicted phosphoesterase, papain-like proteinase, Y-domain, transmembrane domain 1 (TM1), adenosine diphosphate-ribose 1''-phosphatase (ADRP); produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp3;protein_id=YP_009725299.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    8555    10054   .   +   .   ID=id-YP_009724389.1:2764..3263;Note=nsp4B_TM; contains transmembrane domain 2 (TM2); produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp4;protein_id=YP_009725300.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    10055   10972   .   +   .   ID=id-YP_009724389.1:3264..3569;Note=nsp5A_3CLpro and nsp5B_3CLpro; main proteinase (Mpro); mediates cleavages downstream of nsp4. 3D structure of the SARSr-CoV homolog has been determined (Yang et al., 2003); produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=3C-like proteinase;protein_id=YP_009725301.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    10973   11842   .   +   .   ID=id-YP_009724389.1:3570..3859;Note=nsp6_TM; putative transmembrane domain; produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp6;protein_id=YP_009725302.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    11843   12091   .   +   .   ID=id-YP_009724389.1:3860..3942;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp7;protein_id=YP_009725303.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    12092   12685   .   +   .   ID=id-YP_009724389.1:3943..4140;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp8;protein_id=YP_009725304.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    12686   13024   .   +   .   ID=id-YP_009724389.1:4141..4253;Note=ssRNA-binding protein; produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp9;protein_id=YP_009725305.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    13025   13441   .   +   .   ID=id-YP_009724389.1:4254..4392;Note=nsp10_CysHis; formerly known as growth-factor-like protein (GFL); produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=nsp10;protein_id=YP_009725306.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    13442   13468   .   +   .   ID=id-YP_009724389.1:4393..5324;Note=nsp12; NiRAN and RdRp; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=RNA-dependent RNA polymerase;protein_id=YP_009725307.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    13468   16236   .   +   .   ID=id-YP_009724389.1:4393..5324;Note=nsp12; NiRAN and RdRp; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=RNA-dependent RNA polymerase;protein_id=YP_009725307.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    16237   18039   .   +   .   ID=id-YP_009724389.1:5325..5925;Note=nsp13_ZBD, nsp13_TB, and nsp_HEL1core; zinc-binding domain (ZD), NTPase/helicase domain (HEL), RNA 5'-triphosphatase; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=helicase;protein_id=YP_009725308.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    18040   19620   .   +   .   ID=id-YP_009724389.1:5926..6452;Note=nsp14A2_ExoN and nsp14B_NMT; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=3'-to-5' exonuclease;protein_id=YP_009725309.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    19621   20658   .   +   .   ID=id-YP_009724389.1:6453..6798;Note=nsp15-A1 and nsp15B-NendoU; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=endoRNAse;protein_id=YP_009725310.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    20659   21552   .   +   .   ID=id-YP_009724389.1:6799..7096;Note=nsp16_OMT; 2'-o-MT; produced by pp1ab only;Parent=cds-YP_009724389.1;gbkey=Prot;product=2'-O-ribose methyltransferase;protein_id=YP_009725311.1;extra_copy_number=0
NC_045512.2 Liftoff CDS 266 13480   .   +   .   ID=cds-YP_009725295.1;Parent=gene-GU280_gp01;Dbxref=GenBank:YP_009725295.1,GeneID:43740578;Name=YP_009725295.1;Note=pp1a;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1a polyprotein;protein_id=YP_009725295.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    266 805 .   +   .   ID=id-YP_009725295.1:1..180;Note=nsp1; produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=leader protein;protein_id=YP_009742608.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    806 2719    .   +   .   ID=id-YP_009725295.1:181..818;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp2;protein_id=YP_009742609.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    2720    8554    .   +   .   ID=id-YP_009725295.1:819..2763;Note=former nsp1; conserved domains are: N-terminal acidic (Ac), predicted phosphoesterase, papain-like proteinase, Y-domain, transmembrane domain 1 (TM1), adenosine diphosphate-ribose 1''-phosphatase (ADRP); produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp3;protein_id=YP_009742610.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    8555    10054   .   +   .   ID=id-YP_009725295.1:2764..3263;Note=nsp4B_TM; contains transmembrane domain 2 (TM2); produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp4;protein_id=YP_009742611.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    10055   10972   .   +   .   ID=id-YP_009725295.1:3264..3569;Note=nsp5A_3CLpro and nsp5B_3CLpro; main proteinase (Mpro); mediates cleavages downstream of nsp4. 3D structure of the SARSr-CoV homolog has been determined (Yang et al., 2003); produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=3C-like proteinase;protein_id=YP_009742612.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    10973   11842   .   +   .   ID=id-YP_009725295.1:3570..3859;Note=nsp6_TM; putative transmembrane domain; produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp6;protein_id=YP_009742613.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    11843   12091   .   +   .   ID=id-YP_009725295.1:3860..3942;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp7;protein_id=YP_009742614.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    12092   12685   .   +   .   ID=id-YP_009725295.1:3943..4140;Note=produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp8;protein_id=YP_009742615.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    12686   13024   .   +   .   ID=id-YP_009725295.1:4141..4253;Note=ssRNA-binding protein; produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp9;protein_id=YP_009742616.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    13025   13441   .   +   .   ID=id-YP_009725295.1:4254..4392;Note=nsp10_CysHis; formerly known as growth-factor-like protein (GFL); produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp10;protein_id=YP_009742617.1;extra_copy_number=0
NC_045512.2 Liftoff mature_protein_region_of_CDS    13442   13480   .   +   .   ID=id-YP_009725295.1:4393..4405;Note=produced by pp1a only;Parent=cds-YP_009725295.1;gbkey=Prot;product=nsp11;protein_id=YP_009725312.1;extra_copy_number=0
NC_045512.2 Liftoff CDS 13468   21555   .   +   .   ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=GenBank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab; translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1;extra_copy_number=0
NC_045512.2 Liftoff gene    21563   25384   .   +   .   ID=gene-GU280_gp02;Dbxref=GeneID:43740568;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp02_0
NC_045512.2 Liftoff CDS 21563   25384   .   +   .   ID=cds-YP_009724390.1;Parent=gene-GU280_gp02;Dbxref=GenBank:YP_009724390.1,GeneID:43740568;Name=YP_009724390.1;Note=structural protein; spike protein;gbkey=CDS;gene=S;locus_tag=GU280_gp02;product=surface glycoprotein;protein_id=YP_009724390.1;extra_copy_number=0
NC_045512.2 Liftoff gene    25393   26220   .   +   .   ID=gene-GU280_gp03;Dbxref=GeneID:43740569;Name=ORF3a;gbkey=Gene;gene=ORF3a;gene_biotype=protein_coding;locus_tag=GU280_gp03;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp03_0
NC_045512.2 Liftoff CDS 25393   26220   .   +   .   ID=cds-YP_009724391.1;Parent=gene-GU280_gp03;Dbxref=GenBank:YP_009724391.1,GeneID:43740569;Name=YP_009724391.1;gbkey=CDS;gene=ORF3a;locus_tag=GU280_gp03;product=ORF3a protein;protein_id=YP_009724391.1;extra_copy_number=0
NC_045512.2 Liftoff gene    26245   26472   .   +   .   ID=gene-GU280_gp04;Dbxref=GeneID:43740570;Name=E;gbkey=Gene;gene=E;gene_biotype=protein_coding;locus_tag=GU280_gp04;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp04_0
NC_045512.2 Liftoff CDS 26245   26472   .   +   .   ID=cds-YP_009724392.1;Parent=gene-GU280_gp04;Dbxref=GenBank:YP_009724392.1,GeneID:43740570;Name=YP_009724392.1;Note=ORF4; structural protein; E protein;gbkey=CDS;gene=E;locus_tag=GU280_gp04;product=envelope protein;protein_id=YP_009724392.1;extra_copy_number=0
NC_045512.2 Liftoff gene    26523   27191   .   +   .   ID=gene-GU280_gp05;Dbxref=GeneID:43740571;Name=M;gbkey=Gene;gene=M;gene_biotype=protein_coding;locus_tag=GU280_gp05;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp05_0
NC_045512.2 Liftoff CDS 26523   27191   .   +   .   ID=cds-YP_009724393.1;Parent=gene-GU280_gp05;Dbxref=GenBank:YP_009724393.1,GeneID:43740571;Name=YP_009724393.1;Note=ORF5; structural protein;gbkey=CDS;gene=M;locus_tag=GU280_gp05;product=membrane glycoprotein;protein_id=YP_009724393.1;extra_copy_number=0
NC_045512.2 Liftoff gene    27202   27387   .   +   .   ID=gene-GU280_gp06;Dbxref=GeneID:43740572;Name=ORF6;gbkey=Gene;gene=ORF6;gene_biotype=protein_coding;locus_tag=GU280_gp06;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp06_0
NC_045512.2 Liftoff CDS 27202   27387   .   +   .   ID=cds-YP_009724394.1;Parent=gene-GU280_gp06;Dbxref=GenBank:YP_009724394.1,GeneID:43740572;Name=YP_009724394.1;gbkey=CDS;gene=ORF6;locus_tag=GU280_gp06;product=ORF6 protein;protein_id=YP_009724394.1;extra_copy_number=0
NC_045512.2 Liftoff gene    27394   27759   .   +   .   ID=gene-GU280_gp07;Dbxref=GeneID:43740573;Name=ORF7a;gbkey=Gene;gene=ORF7a;gene_biotype=protein_coding;locus_tag=GU280_gp07;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp07_0
NC_045512.2 Liftoff CDS 27394   27759   .   +   .   ID=cds-YP_009724395.1;Parent=gene-GU280_gp07;Dbxref=GenBank:YP_009724395.1,GeneID:43740573;Name=YP_009724395.1;gbkey=CDS;gene=ORF7a;locus_tag=GU280_gp07;product=ORF7a protein;protein_id=YP_009724395.1;extra_copy_number=0
NC_045512.2 Liftoff gene    27756   27887   .   +   .   ID=gene-GU280_gp08;Dbxref=GeneID:43740574;Name=ORF7b;gbkey=Gene;gene=ORF7b;gene_biotype=protein_coding;locus_tag=GU280_gp08;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp08_0
NC_045512.2 Liftoff CDS 27756   27887   .   +   .   ID=cds-YP_009725318.1;Parent=gene-GU280_gp08;Dbxref=GenBank:YP_009725318.1,GeneID:43740574;Name=YP_009725318.1;gbkey=CDS;gene=ORF7b;locus_tag=GU280_gp08;product=ORF7b;protein_id=YP_009725318.1;extra_copy_number=0
NC_045512.2 Liftoff gene    27894   28259   .   +   .   ID=gene-GU280_gp09;Dbxref=GeneID:43740577;Name=ORF8;gbkey=Gene;gene=ORF8;gene_biotype=protein_coding;locus_tag=GU280_gp09;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp09_0
NC_045512.2 Liftoff CDS 27894   28259   .   +   .   ID=cds-YP_009724396.1;Parent=gene-GU280_gp09;Dbxref=GenBank:YP_009724396.1,GeneID:43740577;Name=YP_009724396.1;gbkey=CDS;gene=ORF8;locus_tag=GU280_gp09;product=ORF8 protein;protein_id=YP_009724396.1;extra_copy_number=0
NC_045512.2 Liftoff gene    28274   29533   .   +   .   ID=gene-GU280_gp10;Dbxref=GeneID:43740575;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=GU280_gp10;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp10_0
NC_045512.2 Liftoff CDS 28274   29533   .   +   .   ID=cds-YP_009724397.2;Parent=gene-GU280_gp10;Dbxref=GenBank:YP_009724397.2,GeneID:43740575;Name=YP_009724397.2;Note=ORF9; structural protein;gbkey=CDS;gene=N;locus_tag=GU280_gp10;product=nucleocapsid phosphoprotein;protein_id=YP_009724397.2;extra_copy_number=0
NC_045512.2 Liftoff gene    29558   29674   .   +   .   ID=gene-GU280_gp11;Dbxref=GeneID:43740576;Name=ORF10;gbkey=Gene;gene=ORF10;gene_biotype=protein_coding;locus_tag=GU280_gp11;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=True;valid_ORFs=1;extra_copy_number=0;copy_num_ID=gene-GU280_gp11_0
NC_045512.2 Liftoff CDS 29558   29674   .   +   .   ID=cds-YP_009725255.1;Parent=gene-GU280_gp11;Dbxref=GenBank:YP_009725255.1,GeneID:43740576;Name=YP_009725255.1;gbkey=CDS;gene=ORF10;locus_tag=GU280_gp11;product=ORF10 protein;protein_id=YP_009725255.1;extra_copy_number=0
##FASTA
ACCGACTAG...

Can you help me figure out why the len(contig) is of type None.

Appreciate your help!

axbazin commented 5 days ago

Hi,

I think this is coming from the missing "##sequence-region" which is often found at the beginning of gff3.

The code is supposed to be resilient to this pragma being missing (since it is not mandatory), but it looks like there was a hickup in dealing with it with your particular file.

Any chance you could share that one file so we can reproduce the bug and try to find a fix?

Have a nice day, Adelme