nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
312 stars 83 forks source link

Complete ORFs #481

Closed vcastroagudin closed 3 years ago

vcastroagudin commented 3 years ago

Hello,

is there a way to know which is the percentage of complete sequences among from the ORFs/proteins predicted by Funannotate?

Thanks.

hyphaltip commented 3 years ago

Stops are not in the protein files so it isn't possible to count this directly there but I wrote a short script you can use on the .cds-transcripts.fa file.

git clone https://github.com/hyphaltip/genome-scripts.git
genome-scripts/seq/count_full_ORFs_CDS.py ./Aspergillus_fumigatus_F16311.cds-transcripts.fa

@nextgenusfs - I did notice the translation step in funannotate always drops the last codon even if it isn't a stop codon - we should probably fix that. eg - here's a protein

>F16311_002680-T1 F16311_002680
MAKIYKDTRVDLRPYSPNTVANIQIPNQESSRNRARFSIATSFASLEEPAAKDEEEFSRRYLATQGSIYFRKHKVYPRTF
LWRVVNDNKVLEIQCVDLTKGGIEHHEYDITLRFDFQEEILPSGVDIADLEDHDVLSVFVITASKQLHTLTLRPEFFRRL
AAIDENVSEWCKSCTPAPLAFSHPHRLHASSPLELFISLDSGALLRLTRRAGDD

Here's corresponding coding sequence

>F16311_002680-T1 F16311_002680
ATGGCGAAAATCTATAAGGATACAAGAGTCGACCTTCGGCCATATTCGCCGAACACCGTCGCCAACATCCAGATTCCAAA
CCAAGAGAGCTCGCGGAACCGCGCCCGATTCTCGATAGCCACATCCTTCGCGAGTCTTGAAGAACCAGCAGCGAAAGACG
AGGAGGAGTTTTCTCGCCGATACCTAGCTACACAGGGGTCAATTTACTTCCGGAAACACAAAGTGTATCCCCGAACCTTT
CTCTGGCGGGTGGTTAATGACAACAAGGTCCTGGAGATCCAGTGCGTGGATTTGACGAAGGGAGGGATTGAACATCATGA
ATACGACATCACATTACGCTTCGACTTCCAGGAAGAGATCCTTCCATCCGGGGTTGACATCGCTGACTTGGAGGATCATG
ACGTGTTGAGTGTGTTTGTGATCACGGCTTCCAAGCAGCTGCACACGTTGACTCTGCGCCCGGAGTTCTTTCGCCGACTA
GCGGCAATTGATGAGAATGTATCTGAGTGGTGTAAATCATGCACACCGGCACCTCTGGCTTTCTCGCATCCCCATCGGTT
ACATGCGAGCAGTCCACTCGAGTTGTTTATTTCTCTAGATAGCGGCGCGCTGCTGCGCCTGACTAGGAGGGCTGGTGACG
ATG
nextgenusfs commented 3 years ago

errr. yes def need to fix if that is happening. Is this in the protein fasta output from which step? Or the translate function directly?

nextgenusfs commented 3 years ago

So the code that is writing the protein fasta files from the tbl format is here: https://github.com/nextgenusfs/funannotate/blob/95b18c05125e68b4c38893914486571c77c0d813/funannotate/library.py#L2281-2323

It seems like it is only stripping if endswith('*')?

So that means it would have to be parsing it wrong from the tbl file.... which doesn't seem to be happening??

>>> import funannotate.library as lib
>>> Genes = {}
>>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes)
>>> for k,v in Genes.items():
...     for i in range(0, len(v['ids'])):
...             print(v['ids'][i], v['protein'][i])
... 
FUN_000001-T1 MKENELKNEKSVDVLSFKQLESQKTVLPQDVFQNELTWFCYEIYKSLAFRIWMLLWLPLSIWWKLSNNWVYPLIVSLLVLFLGPIFVLVICGLSRKRSLSKQLIQFCKEITKNTPCLDTHDWEVVVANLNSYLYENKAWNTKYFFFNATDCEKMFRTTVLEPFSLKKDKAAKVKSFKDSVPYIEEALGVYFTEVEKHWKLFNTEKSWSPVGLEDAKLPKEAHRSKFTWLLGRIFTIYFLPLCPAFFNCIYTSRNDDLISDFLWTVVIFLFMVWLFRNIRMIVLSVKMEHKMQFLSTIINEQESGANGWDEIARKINRYLFEKKVWNNEEFFFDGIDCEWFFSHFFYRLLSAKKSMWLLPLNVELWPYIKEAQLSRNEESLMKK*
FUN_000002-T1 MVSLANVLTNATAATLSAWSNTVPLETYFHFDEASGFGDYYLNVSVIWMNETLYETRIVPAIINVREWLDHMEANDPSPSVTNPYETSGYYAFSTVVPVLMGNMKVS*
FUN_000003-T1 MSACPCNIVILPVEILKNSSKDTKYSLYTTINRGYDVPRLKYGIIVSPRVHSLETLFSDLGFDKNIEKSSLYLLLNDPTLAYPNFHEHFEQLKGETNKDLSLPTYYIPKVQFLTEAFDSEHTLATIGYKPNNKESYEITGFTSMGNGYGIKLFNYSVIHMMRSHKCKRVVADIIMEHDLLGYYEKKLGFVEVQRFKVLKEQHQVKVFDDKVDFTKDFHVIKMIKELGNHKL*
FUN_000004-T1 MTAAHPVAQLTAEAYPKVKRNPNFKVLDSEDLAYFRSILSNDEILNSQAPEELASFNQDWMKKYRGQSNLILLPNSTDKVSKIMKYCNDKKLAVVPQGGNTDLVGASVPVFDEIVLSLRNMNKVRDFDPVSGTFKCDAGVVMRDAHQFLHDHDHIFPLDLPSRNNCQVGGVVSTNAGGLNFLRYGSLHGNVLGLEVVLPNGEIISNINALRKDNTGYDLKQLFIGAEGTIGVVTGVSIVAAAKPKALNAVFFGIENFDTVQKLFVKAKSELSEILSAFEFMDRGSIECTIEYLKDLPFPLENQHNFYVLIETSGSNKRHDDEKLTAFLKDTTDSKLISEGMMAKDKADYDRLWTWRKSVPTACNSYGGMYKYDMSLQLKDLYSVSAAVTERLNAAGLIGDAPKPVVKSCGYGHVGDGNIHLNIAVREFTKQIEDLLEPFVYEYIASKKGSISAEHGIGFHKKGKLHYTRSDIEIRFMKDIKNHYDPNGILNPYKYI*
FUN_000005-T1 MTKSDETTATSLNAKTLKSFESTLRIPTYPREGVKQGIVHLGVGAFHRSHLAVFMHRLMQEHHLKDWSICGVGLMKADALMRDAMKAQDCLYTLVERGIKDTNAYIVGSITAYMYAPDDPRAVIEKMANPDTHIVSLTVTENGYYHSEATNSLMTDAPEIINDLNHPEKPDTLYGYLYEALLLRYKRGLTPFTIMSCDNMP*
hyphaltip commented 3 years ago

I am just reporting what was in the annotate_results file so I'm not sure.

hyphaltip commented 3 years ago

I see now - there's one extra base at the end -- so it is working. sorry for false alarm, I just assumed the CDS would be in intervals of 3 so saw the trailing codon at the end of the line.

>F16311_002680-T1 F16311_002680
ATGGCGAAAATCTATAAGGATACAAGAGTCGACCTTCGGCCATATTCGCCGAACACCGTCGCCAACATCCAGATTCCAAA
CCAAGAGAGCTCGCGGAACCGCGCCCGATTCTCGATAGCCACATCCTTCGCGAGTCTTGAAGAACCAGCAGCGAAAGACG
AGGAGGAGTTTTCTCGCCGATACCTAGCTACACAGGGGTCAATTTACTTCCGGAAACACAAAGTGTATCCCCGAACCTTT
CTCTGGCGGGTGGTTAATGACAACAAGGTCCTGGAGATCCAGTGCGTGGATTTGACGAAGGGAGGGATTGAACATCATGA
ATACGACATCACATTACGCTTCGACTTCCAGGAAGAGATCCTTCCATCCGGGGTTGACATCGCTGACTTGGAGGATCATG
ACGTGTTGAGTGTGTTTGTGATCACGGCTTCCAAGCAGCTGCACACGTTGACTCTGCGCCCGGAGTTCTTTCGCCGACTA
GCGGCAATTGATGAGAATGTATCTGAGTGGTGTAAATCATGCACACCGGCACCTCTGGCTTTCTCGCATCCCCATCGGTT
ACATGCGAGCAGTCCACTCGAGTTGTTTATTTCTCTAGATAGCGGCGCGCTGCTGCGCCTGACTAGGAGGGCTGGTGACG
ATG
atg gcg aaa atc tat aag gat aca aga gtc gac ctt cgg cca tat tcg ccg aac acc gtc 
 M   A   K   I   Y   K   D   T   R   V   D   L   R   P   Y   S   P   N   T   V  
gcc aac atc cag att cca aac caa gag agc tcg cgg aac cgc gcc cga ttc tcg ata gcc 
 A   N   I   Q   I   P   N   Q   E   S   S   R   N   R   A   R   F   S   I   A  
aca tcc ttc gcg agt ctt gaa gaa cca gca gcg aaa gac gag gag gag ttt tct cgc cga 
 T   S   F   A   S   L   E   E   P   A   A   K   D   E   E   E   F   S   R   R  
tac cta gct aca cag ggg tca att tac ttc cgg aaa cac aaa gtg tat ccc cga acc ttt 
 Y   L   A   T   Q   G   S   I   Y   F   R   K   H   K   V   Y   P   R   T   F  
ctc tgg cgg gtg gtt aat gac aac aag gtc ctg gag atc cag tgc gtg gat ttg acg aag 
 L   W   R   V   V   N   D   N   K   V   L   E   I   Q   C   V   D   L   T   K  
gga ggg att gaa cat cat gaa tac gac atc aca tta cgc ttc gac ttc cag gaa gag atc 
 G   G   I   E   H   H   E   Y   D   I   T   L   R   F   D   F   Q   E   E   I  
ctt cca tcc ggg gtt gac atc gct gac ttg gag gat cat gac gtg ttg agt gtg ttt gtg 
 L   P   S   G   V   D   I   A   D   L   E   D   H   D   V   L   S   V   F   V  
atc acg gct tcc aag cag ctg cac acg ttg act ctg cgc ccg gag ttc ttt cgc cga cta 
 I   T   A   S   K   Q   L   H   T   L   T   L   R   P   E   F   F   R   R   L  
gcg gca att gat gag aat gta tct gag tgg tgt aaa tca tgc aca ccg gca cct ctg gct 
 A   A   I   D   E   N   V   S   E   W   C   K   S   C   T   P   A   P   L   A  
ttc tcg cat ccc cat cgg tta cat gcg agc agt cca ctc gag ttg ttt att tct cta gat 
 F   S   H   P   H   R   L   H   A   S   S   P   L   E   L   F   I   S   L   D  
agc ggc gcg ctg ctg cgc ctg act agg agg gct ggt gac gat 
 S   G   A   L   L   R   L   T   R   R   A   G   D   D 
nextgenusfs commented 3 years ago

Okay, so is that model than 3' partial?

hyphaltip commented 3 years ago

yes it is.

nextgenusfs commented 3 years ago

Hi @vcastroagudin, @hyphaltip solution will work ad hoc, an alternative way would be to parse the tbl and fasta file like this:

>>> import funannotate.library as lib
>>> Genes = {}
>>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes)
>>> stats = {'genes': 0, 'trna': 0, 'rrna': 0, 'mrna': 0, 'ncrna': 0, 'transcripts': 0, 'partial5': 0, 'partial3':0}
>>> for k, v in Genes.items():
...   stats['genes'] += 1
...   if v['type'] == 'tRNA':
...     stats['trna'] += 1
...   elif v['type'] == 'rRNA':
...     stats['rrna'] += 1
...   elif v['type'] == 'ncRNA':
...     stats['ncrna'] += 1
...   elif v['type'] == 'mRNA':
...     stats['mrna'] += 1
...   for i in range(0, len(v['ids'])):
...     stats['transcripts'] += 1
...     if v['type'] == 'mRNA':
...       if v['partialStart'][i]:
...         stats['partial5'] += 1
...       if v['partialStop'][i]:
...         stats['partial3'] += 1
... 
>>> stats
{'genes': 1734, 'trna': 110, 'rrna': 0, 'mrna': 1624, 'ncrna': 0, 'transcripts': 1737, 'partial5': 0, 'partial3': 0}
>>> 

Would be relatively trivial to add this as an output from funannotate. Where would it make the most sense? A separate file? Or just print to the log file?

hyphaltip commented 3 years ago

nice - I forgot the tbl file has the info too.

nextgenusfs commented 3 years ago

There is probably a way to parse just the partial information faster from tbl file, basically if there are '>' or '<' in the coordinates in the tbl file then it is partial, the above does a little extra turning it into the funannotate dictionary format -- but then other stats are a little easier to grab.

vcastroagudin commented 3 years ago

Thanks

On Fri, Sep 25, 2020 at 1:03 AM Jon Palmer notifications@github.com wrote:

There is probably a way to parse just the partial information faster from tbl file, basically if there are '>' or '<' in the coordinates in the tbl file then it is partial, the above does a little extra turning it into the funannotate dictionary format -- but then other stats are a little easier to grab.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/481#issuecomment-698721179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNO35QPRBRQG76VD7XWCR3SHQQANANCNFSM4RYR5QAQ .

-- Vanina L. Castroagudín, PhD

Postdoctoral Researcher

Mycology & Nematology Genetic Diversity & Biology Laboratory

USDA - ARS

10300 Baltimore Avenue, Bldg 10A, Room 227

Beltsville, MD 20705

Cell: (479) 313-3780

Vanina.Castroagudin@usda.gov vanina.castroagudin@usda.gov

v.castroagudin@gmail.com v.castroagudin@gmail.com

vcastroagudin commented 3 years ago

Jon, What happens with regulatory sequences in the predicted mRNA? If I am not including RNAseq data in the prediction/annotation, will the predicted mRNA always finish in a stop codon? what if the stop codon is not at the end of the predicted mRNA? wouldn't that interpret a complete protein as incomplete?

Vanina

vcastroagudin commented 3 years ago

Hi @vcastroagudin, @hyphaltip solution will work ad hoc, an alternative way would be to parse the tbl and fasta file like this:

>>> import funannotate.library as lib
>>> Genes = {}
>>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes)
>>> stats = {'genes': 0, 'trna': 0, 'rrna': 0, 'mrna': 0, 'ncrna': 0, 'transcripts': 0, 'partial5': 0, 'partial3':0}
>>> for k, v in Genes.items():
...   stats['genes'] += 1
...   if v['type'] == 'tRNA':
...     stats['trna'] += 1
...   elif v['type'] == 'rRNA':
...     stats['rrna'] += 1
...   elif v['type'] == 'ncRNA':
...     stats['ncrna'] += 1
...   elif v['type'] == 'mRNA':
...     stats['mrna'] += 1
...   for i in range(0, len(v['ids'])):
...     stats['transcripts'] += 1
...     if v['type'] == 'mRNA':
...       if v['partialStart'][i]:
...         stats['partial5'] += 1
...       if v['partialStop'][i]:
...         stats['partial3'] += 1
... 
>>> stats
{'genes': 1734, 'trna': 110, 'rrna': 0, 'mrna': 1624, 'ncrna': 0, 'transcripts': 1737, 'partial5': 0, 'partial3': 0}
>>> 

Would be relatively trivial to add this as an output from funannotate. Where would it make the most sense? A separate file? Or just print to the log file?

This would be very useful! I think it could be very useful as a separate file in the annotation or prediction miscellaneous folder. Now, in it you mention ncrna.. does funannotate predict non-coding RNA loci????? Where are they listed????

Vanina

nextgenusfs commented 3 years ago

Jon, What happens with regulatory sequences in the predicted mRNA? If I am not including RNAseq data in the prediction/annotation, will the predicted mRNA always finish in a stop codon? what if the stop codon is not at the end of the predicted mRNA? wouldn't that interpret a complete protein as incomplete?

If there is no RNA-seq, then the pipeline won't predict any UTRs. So the mRNA will be the same as the cds-transcript. It will allow for partial genes, ie those that might span or end near a gap in assembly, end of contig, etc. So generally most of the gene models should have proper starts (ATG) and proper stops (TGA, TAA, TAG).

This would be very useful! I think it could be very useful as a separate file in the annotation or prediction miscellaneous folder. Now, in it you mention ncrna.. does funannotate predict non-coding RNA loci????? Where are they listed????

The code will support rRNA, tRNA, mRNA, ncRNA features -- although it only predicts tRNA and mRNA. The others can be added manually by editing the annotation tbl files for example, and then running funannotate fix to re-generate the proper output files.

vcastroagudin commented 3 years ago

Jon do you have any favorite program for predicting ncRNA (other than tRNA) which results could be eventually parsed to funannotate with the fix command?

Thanks

nextgenusfs commented 3 years ago

Admittedly I've not looked into this very much -- mostly I add some ncRNA's that I find more/less manually from transcriptome data. I'm not sure if there is a pipeline that works well for capturing other ncRNAs?

nextgenusfs commented 3 years ago

Hi @vcastroagudin, I added a *.stats.json output file to the _results folder. There is also now a script in the funannotate util menu that can yield the following results:

$ funannotate util stats

Usage:       funannotate util stats <arguments>
version:     1.8.0

Description: Generate JSON file with genome assembly and annotation stats.

Arguments:
  -f, --fasta              Genome FASTA file (Required)
  -o, --out                Output file (JSON format)
  -g, --gff3               Genome Annotation (GFF3 format)
  -t, --tbl                Genome Annotation (NCBI TBL format)
  --transcript_alignments  Transcript alignments (GFF3 format)
  --protein_alignments     Protein alignments (GFF3 format)

Example run and output:

$ funannotate util stats -f rna-seq/annotate_results/Awesome_rna.scaffolds.fa -o test.json -g rna-seq/annotate_results/Awesome_rna.gff3 
Generating stats from Genome FASTA file and GFF3 annotation
Finished writing JSON stats file: test.json
$ cat test.json 
{
    "assembly": {
        "num_contigs": 6,
        "length": 3776588,
        "mean_length": 629431.33,
        "N50": 766242,
        "L50": 2,
        "N90": 449806,
        "L90": 5,
        "GC_content": 38.56
    },
    "annotation": {
        "genes": 1727,
        "common_name": 1421,
        "mRNA": 1620,
        "tRNA": 110,
        "ncRNA": 0,
        "rRNA": 0,
        "avg_gene_length": 1404.69,
        "transcript-level": {
            "CDS_transcripts": 1620,
            "CDS_five_utr": 0,
            "CDS_three_utr": 73,
            "CDS_no_utr": 1396,
            "CDS_five_three_utr": 151,
            "CDS_complete": 1620,
            "CDS_no-start": 0,
            "CDS_no-stop": 0,
            "CDS_no-start_no-stop": 0,
            "total_exons": 2067,
            "total_cds_exons": 1678,
            "multiple_exon_transcript": 62,
            "single_exon_transcript": 1558,
            "avg_exon_length": 482.04,
            "avg_protein_length": 481.1,
            "functional": {
                "go_terms": 0,
                "interproscan": 0,
                "eggnog": 0,
                "pfam": 1336,
                "cazyme": 35,
                "merops": 60,
                "busco": 369,
                "secretion": 0
            }
        }
    }
vcastroagudin commented 3 years ago

Awesome Jon. Thanks so much!

Vanina

Nitin123-4 commented 3 months ago

All the mRNA reported here are full length genes? Or we can further filter them?