Closed JC-therea closed 2 years ago
Is your gff structured gff3 with gene / mRNA /CDS features?
On Tue, Jun 28, 2022 at 5:59 AM JC-therea @.***> wrote:
Hello,
Describe the bug
I am using the lastest version of Funannotate and I am having problems with funannotate update. I think the problem is related with the function gff2pasa and my annotation but I don't know how to fix it. The locus tags looks like: gene-YPR204W
command
funannotate update -f $GENOMEFIX -g $GFF -o Scer --nanopore_mrna $dRNA --no_trimmomatic --pasa_db mysql --stranded F --jaccard_clip --species "Saccharomyces cerevisiae" --cpus 8 --no-progress Logfiles [Jun 18 10:20 AM]: OS: CentOS Linux 7, 48 cores, ~ 528 GB RAM. Python: 3.7.12 [Jun 18 10:20 AM]: Running 1.8.9 [Jun 18 10:20 AM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt' ERROR: ID=gene-YIL175W has no CDS features, removing gene model ERROR: ID=gene-YIL174W has no CDS features, removing gene model ERROR: ID=gene-YIL171W has no CDS features, removing gene model ERROR: ID=gene-YIL170W has no CDS features, removing gene model ERROR: ID=gene-YFL056C has no CDS features, removing gene model ERROR: ID=gene-YAR061W has no CDS features, removing gene model ERROR: ID=gene-YPL276W has no CDS features, removing gene model ERROR: ID=gene-YPL275W has no CDS features, removing gene model ERROR: ID=gene-YLL017W has no CDS features, removing gene model ERROR: ID=gene-YLL016W has no CDS features, removing gene model ERROR: ID=gene-YLR110C has no CDS features, removing gene model ERROR: ID=gene-YCL075W has no CDS features, removing gene model ERROR: ID=gene-YCL074W has no CDS features, removing gene model [Jun 18 10:21 AM]: Previous annotation consists of: 5,948 protein coding gene models and 310 non-coding gene models [Jun 18 10:21 AM]: Trimmomatic will be skipped [Jun 18 10:21 AM]: Existing BAM alignments found: Scer/update_misc/transcript.alignments.bam [Jun 18 10:21 AM]: Running PASA alignment step using 499,999 transcripts [Jun 23 11:22 PM]: Running PASA annotation comparison step 1 [Jun 23 11:48 PM]: Running PASA annotation comparison step 2 [Jun 24 12:17 AM]: Generating relative expression values to PASA transcripts [Jun 24 12:19 AM]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus. [Jun 24 12:19 AM]: Wrote 6,140 transcripts derived from 5,961 protein coding loci.
Traceback (most recent call last): File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/bin/funannotate", line 10, in sys.exit(main()) File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/funannotate.py", line 705, in main mod.main(arguments) File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/update.py", line 2324, in main alt_transcripts=args.alt_transcripts) File "/soft/EB_repo/devel/programs/noarch/miniconda3/2022-05/envs/funannotate/lib/python3.7/site-packages/funannotate/update.py", line 1149, in GFF2tblCombinedNEW genenumber = int(genenumber) ValueError: invalid literal for int() with base 10: '204W'
Is there some way to modify my annotation to solve this?
Thank you very much.
— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O76EK4LBB7LW7MLIMTVRLEHRANCNFSM52BR6AGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Sent from Gmail Mobile
Jason Stajich - @.***
Yes, it has more features but those are included:
$ grep "^#" -v $GFF | cut -f 3 | sort |uniq -c
6229 CDS
6697 exon
6344 gene
5961 mRNA
13 ncRNA
12 pseudogene
16 region
1 RNase_MRP_RNA
1 RNase_P_RNA
12 rRNA
14 sequence_feature
76 snoRNA
6 snRNA
1 SRP_RNA
10 transcript
275 tRNA
First lines of the file:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCA_000146045.2
#!annotation-source SGD R64-3-1
##sequence-region BK006935.2 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
BK006941.2 tpg region 1 1090940 . + . ID=BK006941.2:1..1090940;Dbxref=taxon:559292;Name=VII;chromosome=VII;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288C
BK006941.2 tpg gene 2790 3932 . + . ID=gene-YGL263W;Name=COS12;gbkey=Gene;gene=COS12;gene_biotype=protein_coding;locus_tag=YGL263W
BK006941.2 tpg mRNA 2790 3932 . + . ID=rna-YGL263W;Parent=gene-YGL263W;gbkey=mRNA;gene=COS12;locus_tag=YGL263W;product=Cos12p
BK006941.2 tpg exon 2790 3932 . + . ID=exon-YGL263W-1;Parent=rna-YGL263W;gbkey=mRNA;gene=COS12;locus_tag=YGL263W;product=Cos12p
BK006941.2 tpg CDS 2790 3932 . + 0 ID=cds-DAA07855.1;Parent=rna-YGL263W;Dbxref=SGD:S000003232,NCBI_GP:DAA07855.1;Name=DAA07855.1;Note=Endosomal protein involved in turnover of plasma membrane proteins%3B member of the DUP380 subfamily of conserved%2C often subtelomeric COS genes%3B required for the multivesicular vesicle body sorting pathway that internalizes plasma membrane proteins for degradation%3B Cos proteins provide ubiquitin in trans for nonubiquitinated cargo proteins;experiment=EXISTENCE:direct assay:GO:0000324 fungal-type vacuole [PMID:26928762];gbkey=CDS;gene=COS12;locus_tag=YGL263W;product=Cos12p;protein_id=DAA07855.1
BK006941.2 tpg gene 5312 5839 . + . ID=gene-YGL262W;Name=YGL262W;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YGL262W
BK006941.2 tpg mRNA 5312 5839 . + . ID=rna-YGL262W;Parent=gene-YGL262W;gbkey=mRNA;locus_tag=YGL262W;product=hypothetical protein
BK006941.2 tpg exon 5312 5839 . + . ID=exon-YGL262W-1;Parent=rna-YGL262W;gbkey=mRNA;locus_tag=YGL262W;product=hypothetical protein
BK006941.2 tpg CDS 5312 5839 . + 0 ID=cds-DAA07856.1;Parent=rna-YGL262W;Dbxref=SGD:S000003231,NCBI_GP:DAA07856.1;Name=DAA07856.1;Note=hypothetical protein%3B null mutant displays elevated sensitivity to expression of a mutant huntingtin fragment or of alpha-synuclein%3B YGL262W is not an essential gene;gbkey=CDS;locus_tag=YGL262W;product=hypothetical protein;protein_id=DAA07856.1
BK006941.2 tpg gene 6290 6652 . - . ID=gene-YGL261C;Name=PAU11;gbkey=Gene;gene=PAU11;gene_biotype=protein_coding;locus_tag=YGL261C
BK006941.2 tpg mRNA 6290 6652 . - . ID=rna-YGL261C;Parent=gene-YGL261C;gbkey=mRNA;gene=PAU11;locus_tag=YGL261C;product=seripauperin PAU11
BK006941.2 tpg exon 6290 6652 . - . ID=exon-YGL261C-1;Parent=rna-YGL261C;gbkey=mRNA;gene=PAU11;locus_tag=YGL261C;product=seripauperin PAU11
BK006941.2 tpg CDS 6290 6652 . - 0 ID=cds-DAA07857.1;Parent=rna-YGL261C;Dbxref=SGD:S000003230,NCBI_GP:DAA07857.1;Name=DAA07857.1;Note=hypothetical protein%3B member of the seripauperin multigene family encoded mainly in subtelomeric regions%3B mRNA expression appears to be regulated by SUT1 and UPC2;gbkey=CDS;gene=PAU11;locus_tag=YGL261C;product=seripauperin PAU11;protein_id=DAA07857.1
BK006941.2 tpg gene 6860 7090 . + . ID=gene-YGL260W;Name=YGL260W;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YGL260W
BK006941.2 tpg mRNA 6860 7090 . + . ID=rna-YGL260W;Parent=gene-YGL260W;gbkey=mRNA;locus_tag=YGL260W;product=hypothetical protein
BK006941.2 tpg exon 6860 7090 . + . ID=exon-YGL260W-1;Parent=rna-YGL260W;gbkey=mRNA;locus_tag=YGL260W;product=hypothetical protein
BK006941.2 tpg CDS 6860 7090 . + 0 ID=cds-DAA07858.1;Parent=rna-YGL260W;Dbxref=SGD:S000003229,NCBI_GP:DAA07858.1;Name=DAA07858.1;Note=hypothetical protein%3B transcription is significantly increased in a NAP1 deletion background%3B deletion mutant has increased accumulation of nickel and selenium;gbkey=CDS;locus_tag=YGL260W;product=hypothetical protein;protein_id=DAA07858.1
BK006941.2 tpg gene 8470 8967 . + . ID=gene-YGL259W;Name=YPS5;gbkey=Gene;gene=YPS5;gene_biotype=protein_coding;locus_tag=YGL259W
BK006941.2 tpg mRNA 8470 8967 . + . ID=rna-YGL259W;Parent=gene-YGL259W;gbkey=mRNA;gene=YPS5;locus_tag=YGL259W;product=Yps5p
BK006941.2 tpg exon 8470 8967 . + . ID=exon-YGL259W-1;Parent=rna-YGL259W;gbkey=mRNA;gene=YPS5;locus_tag=YGL259W;product=Yps5p
BK006941.2 tpg CDS 8470 8967 . + 0 ID=cds-DAA07859.1;Parent=rna-YGL259W;Dbxref=SGD:S000003228,NCBI_GP:DAA07859.1;Name=DAA07859.1;Note=Protein with similarity to GPI-anchored aspartic proteases%3B such proteases are Yap1p and Yap3p%3B mCherry fusion protein localizes to the vacuole;gbkey=CDS;gene=YPS5;locus_tag=YGL259W;product=Yps5p;protein_id=DAA07859.1
BK006941.2 tpg gene 9162 9395 . + . ID=gene-YGL258W-A;Name=YGL258W-A;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YGL258W-A
BK006941.2 tpg mRNA 9162 9395 . + . ID=rna-YGL258W-A;Parent=gene-YGL258W-A;gbkey=mRNA;locus_tag=YGL258W-A;product=hypothetical protein
BK006941.2 tpg exon 9162 9395 . + . ID=exon-YGL258W-A-1;Parent=rna-YGL258W-A;gbkey=mRNA;locus_tag=YGL258W-A;product=hypothetical protein
BK006941.2 tpg CDS 9162 9395 . + 0 ID=cds-DAA07860.1;Parent=rna-YGL258W-A;Dbxref=SGD:S000007607,NCBI_GP:DAA07860.1;Name=DAA07860.1;Note=hypothetical protein;gbkey=CDS;locus_tag=YGL258W-A;product=hypothetical protein;protein_id=DAA07860.1
BK006941.2 tpg gene 11110 11730 . + . ID=gene-YGL258W;Name=VEL1;gbkey=Gene;gene=VEL1;gene_biotype=protein_coding;locus_tag=YGL258W
BK006941.2 tpg mRNA 11110 11730 . + . ID=rna-YGL258W;Parent=gene-YGL258W;gbkey=mRNA;gene=VEL1;locus_tag=YGL258W;product=Vel1p
BK006941.2 tpg exon 11110 11730 . + . ID=exon-YGL258W-1;Parent=rna-YGL258W;gbkey=mRNA;gene=VEL1;locus_tag=YGL258W;product=Vel1p
BK006941.2 tpg CDS 11110 11730 . + 0 ID=cds-DAA07861.1;Parent=rna-YGL258W;Dbxref=SGD:S000003227,NCBI_GP:DAA07861.1;Name=DAA07861.1;Note=hypothetical protein%3B highly induced in zinc-depleted conditions and has increased expression in NAP1 deletion mutants%3B VEL1 has a paralog%2C YOR387C%2C that arose from a single-locus duplication;experiment=EXISTENCE:direct assay:GO:0005783 endoplasmic reticulum [PMID:26928762],EXISTENCE:direct assay:GO:0005829 cytosol [PMID:11935221],EXISTENCE:direct assay:GO:0071944 cell periphery [PMID:26928762];gbkey=CDS;gene=VEL1;locus_tag=YGL258W;product=Vel1p;protein_id=DAA07861.1
BK006941.2 tpg gene 12481 14157 . - . ID=gene-YGL257C;Name=MNT2;gbkey=Gene;gene=MNT2;gene_biotype=protein_coding;locus_tag=YGL257C
BK006941.2 tpg mRNA 12481 14157 . - . ID=rna-YGL257C;Parent=gene-YGL257C;gbkey=mRNA;gene=MNT2;locus_tag=YGL257C;product=alpha-1%2C3-mannosyltransferase MNT2
BK006941.2 tpg exon 12481 14157 . - . ID=exon-YGL257C-1;Parent=rna-YGL257C;gbkey=mRNA;gene=MNT2;locus_tag=YGL257C;product=alpha-1%2C3-mannosyltransferase MNT2
BK006941.2 tpg CDS 12481 14157 . - 0 ID=cds-DAA07862.1;Parent=rna-YGL257C;Dbxref=SGD:S000003226,NCBI_GP:DAA07862.1;Name=DAA07862.1;Note=Mannosyltransferase%3B involved in adding the 4th and 5th mannose residues of O-linked glycans;experiment=EXISTENCE:direct assay:GO:0000329 fungal-type vacuole membrane [PMID:26928762],EXISTENCE:direct assay:GO:0005794 Golgi apparatus [PMID:30700649],EXISTENCE:direct assay:GO:0071944 cell periphery [PMID:26928762],EXISTENCE:genetic interaction:GO:0000033 alpha-1%2C3-mannosyltransferase activity [PMID:10521541],EXISTENCE:genetic interaction:GO:0006493 protein O-linked glycosylation [PMID:10521541],EXISTENCE:mutant phenotype:GO:0000033 alpha-1%2C3-mannosyltransferase activity [PMID:10521541],EXISTENCE:mutant phenotype:GO:0006493 protein O-linked glycosylation [PMID:10521541];gbkey=CDS;gene=MNT2;locus_tag=YGL257C;product=alpha-1%2C3-mannosyltransferase MNT2;protein_id=DAA07862.1
I think the error here is trying to name a new gene -- so funannotate somewhat "rudely" thinks that you will have gene models named with locustag_numerical, ie FUN_00001 or ANID_000001, etc -- so it was trying to find the last numerical gene to then continue the enumeration of a new gene model. So I think I need to see if this is easily fixable when gene names have different structure.
Meanwhile, do you think that I can simply hardcode a random number in that variable to make it work...? For now, I only want an updated annotation of the UTR regions.
Thank you!
The reason funannotate is stringent on the gene names is that it is built to submit to NCBI -- these locus tags are probably "grandfathered" in (ie YGL258W), I don't think this would be acceptable now. I tried a very minimal fix, not sure if this will solve it quite yet as it may break elsewhere.
Are you able to upgrade your funannotate install to the master/main branch and re-run this? I'm not sure it will be fixed, but I need to see what the script identifies as locustag and gene number.
python -m pip install git+https://github.com/nextgenusfs/funannotate.git
And then you should be able to re-run your same command from same location and it should not re-run existing data.
there should be a new line in the log that shows what it autodetected as locust and gene number
[Jun 29 11:42 AM]: Reannotating Awesome rna, NCBI accession: None
[Jun 29 11:42 AM]: Previous annotation consists of: 1,609 protein coding gene models and 112 non-coding gene models
[Jun 29 11:42 AM]: Existing annotation: locustag=FUN_ genenumber=1721
Okay I think this is now fixed.
I pulled this genome/annotation from NCBI. So the models that get dropped are listed as 'mRNA' features but they have no CDS, so they should probably be 'ncRNA'. This will eventually get fixed as I move my newer gfftk library as the backend to support funannotate. The test data for rna-seq module is actually yeast, so I just re-used that data below as an example.
$ ./funannotate_dev/funannotate-docker update --cpus 5 -f yeast.fna -g yeast.gff3 --species "Saccharomyces cerevisiae" --single rna-seq.illumina.fastq.gz --nanopore_mrna rna-seq.nanopore.fastq.gz --jaccard_clip -o test-update-yeast
-------------------------------------------------------
[Jun 30 08:49 AM]: OS: Debian GNU/Linux 10, 4 cores, ~ 8 GB RAM. Python: 3.8.13
[Jun 30 08:49 AM]: Running 1.8.12
[Jun 30 08:49 AM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
ERROR: ID=gene-YAR061W has no CDS features, removing gene model
ERROR: ID=gene-YCL075W has no CDS features, removing gene model
ERROR: ID=gene-YCL074W has no CDS features, removing gene model
ERROR: ID=gene-YFL056C has no CDS features, removing gene model
ERROR: ID=gene-YIL175W has no CDS features, removing gene model
ERROR: ID=gene-YIL174W has no CDS features, removing gene model
ERROR: ID=gene-YIL171W has no CDS features, removing gene model
ERROR: ID=gene-YIL170W has no CDS features, removing gene model
ERROR: ID=gene-YLL017W has no CDS features, removing gene model
ERROR: ID=gene-YLL016W has no CDS features, removing gene model
ERROR: ID=gene-YPL276W has no CDS features, removing gene model
ERROR: ID=gene-YPL275W has no CDS features, removing gene model
[Jun 30 08:49 AM]: Previous annotation consists of: 6,016 protein coding gene models and 337 non-coding gene models
[Jun 30 08:49 AM]: Existing annotation: locustag=FUN_ genenumber=6452
[Jun 30 08:49 AM]: Existing Trinity results found: test-update-yeast/update_misc/trinity.fasta
[Jun 30 08:49 AM]: Existing BAM alignments found: test-update-yeast/update_misc/trinity.alignments.bam, test-update-yeast/update_misc/transcript.alignments.bam
[Jun 30 08:49 AM]: Existing Kallisto output found: test-update-yeast/update_misc/kallisto.tsv
[Jun 30 08:49 AM]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[Jun 30 08:49 AM]: Wrote 6,022 transcripts derived from 6,012 protein coding loci.
[Jun 30 08:49 AM]: Validating gene models (renaming, checking translations, filtering, etc)
[Jun 30 08:49 AM]: Writing 6,323 loci to TBL format: dropped 0 overlapping, 69 too short, and 17 frameshift gene models
[Jun 30 08:49 AM]: Converting to Genbank format
[Jun 30 08:50 AM]: Collecting final annotation files
[Jun 30 08:50 AM]: Comparing original annotation to updated
original: yeast.gff3
updated: test-update-yeast/update_results/Saccharomyces_cerevisiae.gff3
ERROR: ID=gene-YAR061W has no CDS features, removing gene model
ERROR: ID=gene-YCL075W has no CDS features, removing gene model
ERROR: ID=gene-YCL074W has no CDS features, removing gene model
ERROR: ID=gene-YFL056C has no CDS features, removing gene model
ERROR: ID=gene-YIL175W has no CDS features, removing gene model
ERROR: ID=gene-YIL174W has no CDS features, removing gene model
ERROR: ID=gene-YIL171W has no CDS features, removing gene model
ERROR: ID=gene-YIL170W has no CDS features, removing gene model
ERROR: ID=gene-YLL017W has no CDS features, removing gene model
ERROR: ID=gene-YLL016W has no CDS features, removing gene model
ERROR: ID=gene-YPL276W has no CDS features, removing gene model
ERROR: ID=gene-YPL275W has no CDS features, removing gene model
[Jun 30 08:50 AM]: Updated annotation complete:
-------------------------------------------------------
Total Gene Models: 6,323
Total transcripts: 6,334
New Gene Models: 5
No Change: 5,097
Update UTRs: 1,209
Exons Changed: 12
Exons/CDS Changed: 0
Dropped Models: 70
CDS AED: 0.011
mRNA AED: 0.028
-------------------------------------------------------
[Jun 30 08:50 AM]: Funannotate update is finished, output files are in the test-update-yeast/update_results folder
[Jun 30 08:50 AM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required):
funannotate iprscan -i test-update-yeast -m docker -c 5
Run antiSMASH:
funannotate remote -i test-update-yeast -m antismash -e youremail@server.edu
Annotate Genome:
funannotate annotate -i test-update-yeast --cpus 5 --sbt yourSBTfile.txt
-------------------------------------------------------
Sorry for taking so long to reply but I am running Funannotate in a cluster and I can not go as fast as I would like... I included the update that you mention in update.py and now works! Thank you very much for your hard and fast work in this issue.
Hello,
Describe the bug
I am using the lastest version of Funannotate and I am having problems with funannotate update. I think the problem is related with the function gff2pasa and my annotation but I don't know how to fix it. My last locus tags is: gene-YPR204W
command
Logfiles
Is there some way to modify my annotation to solve this?
Thank you very much.