Closed vcastroagudin closed 3 years ago
Stops are not in the protein files so it isn't possible to count this directly there but I wrote a short script you can use on the .cds-transcripts.fa file.
git clone https://github.com/hyphaltip/genome-scripts.git
genome-scripts/seq/count_full_ORFs_CDS.py ./Aspergillus_fumigatus_F16311.cds-transcripts.fa
@nextgenusfs - I did notice the translation step in funannotate always drops the last codon even if it isn't a stop codon - we should probably fix that. eg - here's a protein
>F16311_002680-T1 F16311_002680
MAKIYKDTRVDLRPYSPNTVANIQIPNQESSRNRARFSIATSFASLEEPAAKDEEEFSRRYLATQGSIYFRKHKVYPRTF
LWRVVNDNKVLEIQCVDLTKGGIEHHEYDITLRFDFQEEILPSGVDIADLEDHDVLSVFVITASKQLHTLTLRPEFFRRL
AAIDENVSEWCKSCTPAPLAFSHPHRLHASSPLELFISLDSGALLRLTRRAGDD
Here's corresponding coding sequence
>F16311_002680-T1 F16311_002680
ATGGCGAAAATCTATAAGGATACAAGAGTCGACCTTCGGCCATATTCGCCGAACACCGTCGCCAACATCCAGATTCCAAA
CCAAGAGAGCTCGCGGAACCGCGCCCGATTCTCGATAGCCACATCCTTCGCGAGTCTTGAAGAACCAGCAGCGAAAGACG
AGGAGGAGTTTTCTCGCCGATACCTAGCTACACAGGGGTCAATTTACTTCCGGAAACACAAAGTGTATCCCCGAACCTTT
CTCTGGCGGGTGGTTAATGACAACAAGGTCCTGGAGATCCAGTGCGTGGATTTGACGAAGGGAGGGATTGAACATCATGA
ATACGACATCACATTACGCTTCGACTTCCAGGAAGAGATCCTTCCATCCGGGGTTGACATCGCTGACTTGGAGGATCATG
ACGTGTTGAGTGTGTTTGTGATCACGGCTTCCAAGCAGCTGCACACGTTGACTCTGCGCCCGGAGTTCTTTCGCCGACTA
GCGGCAATTGATGAGAATGTATCTGAGTGGTGTAAATCATGCACACCGGCACCTCTGGCTTTCTCGCATCCCCATCGGTT
ACATGCGAGCAGTCCACTCGAGTTGTTTATTTCTCTAGATAGCGGCGCGCTGCTGCGCCTGACTAGGAGGGCTGGTGACG
ATG
errr. yes def need to fix if that is happening. Is this in the protein fasta output from which step? Or the translate function directly?
So the code that is writing the protein fasta files from the tbl format is here: https://github.com/nextgenusfs/funannotate/blob/95b18c05125e68b4c38893914486571c77c0d813/funannotate/library.py#L2281-2323
It seems like it is only stripping if endswith('*')?
So that means it would have to be parsing it wrong from the tbl file.... which doesn't seem to be happening??
>>> import funannotate.library as lib
>>> Genes = {}
>>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes)
>>> for k,v in Genes.items():
... for i in range(0, len(v['ids'])):
... print(v['ids'][i], v['protein'][i])
...
FUN_000001-T1 MKENELKNEKSVDVLSFKQLESQKTVLPQDVFQNELTWFCYEIYKSLAFRIWMLLWLPLSIWWKLSNNWVYPLIVSLLVLFLGPIFVLVICGLSRKRSLSKQLIQFCKEITKNTPCLDTHDWEVVVANLNSYLYENKAWNTKYFFFNATDCEKMFRTTVLEPFSLKKDKAAKVKSFKDSVPYIEEALGVYFTEVEKHWKLFNTEKSWSPVGLEDAKLPKEAHRSKFTWLLGRIFTIYFLPLCPAFFNCIYTSRNDDLISDFLWTVVIFLFMVWLFRNIRMIVLSVKMEHKMQFLSTIINEQESGANGWDEIARKINRYLFEKKVWNNEEFFFDGIDCEWFFSHFFYRLLSAKKSMWLLPLNVELWPYIKEAQLSRNEESLMKK*
FUN_000002-T1 MVSLANVLTNATAATLSAWSNTVPLETYFHFDEASGFGDYYLNVSVIWMNETLYETRIVPAIINVREWLDHMEANDPSPSVTNPYETSGYYAFSTVVPVLMGNMKVS*
FUN_000003-T1 MSACPCNIVILPVEILKNSSKDTKYSLYTTINRGYDVPRLKYGIIVSPRVHSLETLFSDLGFDKNIEKSSLYLLLNDPTLAYPNFHEHFEQLKGETNKDLSLPTYYIPKVQFLTEAFDSEHTLATIGYKPNNKESYEITGFTSMGNGYGIKLFNYSVIHMMRSHKCKRVVADIIMEHDLLGYYEKKLGFVEVQRFKVLKEQHQVKVFDDKVDFTKDFHVIKMIKELGNHKL*
FUN_000004-T1 MTAAHPVAQLTAEAYPKVKRNPNFKVLDSEDLAYFRSILSNDEILNSQAPEELASFNQDWMKKYRGQSNLILLPNSTDKVSKIMKYCNDKKLAVVPQGGNTDLVGASVPVFDEIVLSLRNMNKVRDFDPVSGTFKCDAGVVMRDAHQFLHDHDHIFPLDLPSRNNCQVGGVVSTNAGGLNFLRYGSLHGNVLGLEVVLPNGEIISNINALRKDNTGYDLKQLFIGAEGTIGVVTGVSIVAAAKPKALNAVFFGIENFDTVQKLFVKAKSELSEILSAFEFMDRGSIECTIEYLKDLPFPLENQHNFYVLIETSGSNKRHDDEKLTAFLKDTTDSKLISEGMMAKDKADYDRLWTWRKSVPTACNSYGGMYKYDMSLQLKDLYSVSAAVTERLNAAGLIGDAPKPVVKSCGYGHVGDGNIHLNIAVREFTKQIEDLLEPFVYEYIASKKGSISAEHGIGFHKKGKLHYTRSDIEIRFMKDIKNHYDPNGILNPYKYI*
FUN_000005-T1 MTKSDETTATSLNAKTLKSFESTLRIPTYPREGVKQGIVHLGVGAFHRSHLAVFMHRLMQEHHLKDWSICGVGLMKADALMRDAMKAQDCLYTLVERGIKDTNAYIVGSITAYMYAPDDPRAVIEKMANPDTHIVSLTVTENGYYHSEATNSLMTDAPEIINDLNHPEKPDTLYGYLYEALLLRYKRGLTPFTIMSCDNMP*
I am just reporting what was in the annotate_results file so I'm not sure.
I see now - there's one extra base at the end -- so it is working. sorry for false alarm, I just assumed the CDS would be in intervals of 3 so saw the trailing codon at the end of the line.
>F16311_002680-T1 F16311_002680
ATGGCGAAAATCTATAAGGATACAAGAGTCGACCTTCGGCCATATTCGCCGAACACCGTCGCCAACATCCAGATTCCAAA
CCAAGAGAGCTCGCGGAACCGCGCCCGATTCTCGATAGCCACATCCTTCGCGAGTCTTGAAGAACCAGCAGCGAAAGACG
AGGAGGAGTTTTCTCGCCGATACCTAGCTACACAGGGGTCAATTTACTTCCGGAAACACAAAGTGTATCCCCGAACCTTT
CTCTGGCGGGTGGTTAATGACAACAAGGTCCTGGAGATCCAGTGCGTGGATTTGACGAAGGGAGGGATTGAACATCATGA
ATACGACATCACATTACGCTTCGACTTCCAGGAAGAGATCCTTCCATCCGGGGTTGACATCGCTGACTTGGAGGATCATG
ACGTGTTGAGTGTGTTTGTGATCACGGCTTCCAAGCAGCTGCACACGTTGACTCTGCGCCCGGAGTTCTTTCGCCGACTA
GCGGCAATTGATGAGAATGTATCTGAGTGGTGTAAATCATGCACACCGGCACCTCTGGCTTTCTCGCATCCCCATCGGTT
ACATGCGAGCAGTCCACTCGAGTTGTTTATTTCTCTAGATAGCGGCGCGCTGCTGCGCCTGACTAGGAGGGCTGGTGACG
ATG
atg gcg aaa atc tat aag gat aca aga gtc gac ctt cgg cca tat tcg ccg aac acc gtc
M A K I Y K D T R V D L R P Y S P N T V
gcc aac atc cag att cca aac caa gag agc tcg cgg aac cgc gcc cga ttc tcg ata gcc
A N I Q I P N Q E S S R N R A R F S I A
aca tcc ttc gcg agt ctt gaa gaa cca gca gcg aaa gac gag gag gag ttt tct cgc cga
T S F A S L E E P A A K D E E E F S R R
tac cta gct aca cag ggg tca att tac ttc cgg aaa cac aaa gtg tat ccc cga acc ttt
Y L A T Q G S I Y F R K H K V Y P R T F
ctc tgg cgg gtg gtt aat gac aac aag gtc ctg gag atc cag tgc gtg gat ttg acg aag
L W R V V N D N K V L E I Q C V D L T K
gga ggg att gaa cat cat gaa tac gac atc aca tta cgc ttc gac ttc cag gaa gag atc
G G I E H H E Y D I T L R F D F Q E E I
ctt cca tcc ggg gtt gac atc gct gac ttg gag gat cat gac gtg ttg agt gtg ttt gtg
L P S G V D I A D L E D H D V L S V F V
atc acg gct tcc aag cag ctg cac acg ttg act ctg cgc ccg gag ttc ttt cgc cga cta
I T A S K Q L H T L T L R P E F F R R L
gcg gca att gat gag aat gta tct gag tgg tgt aaa tca tgc aca ccg gca cct ctg gct
A A I D E N V S E W C K S C T P A P L A
ttc tcg cat ccc cat cgg tta cat gcg agc agt cca ctc gag ttg ttt att tct cta gat
F S H P H R L H A S S P L E L F I S L D
agc ggc gcg ctg ctg cgc ctg act agg agg gct ggt gac gat
S G A L L R L T R R A G D D
Okay, so is that model than 3' partial?
yes it is.
Hi @vcastroagudin, @hyphaltip solution will work ad hoc, an alternative way would be to parse the tbl and fasta file like this:
>>> import funannotate.library as lib
>>> Genes = {}
>>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes)
>>> stats = {'genes': 0, 'trna': 0, 'rrna': 0, 'mrna': 0, 'ncrna': 0, 'transcripts': 0, 'partial5': 0, 'partial3':0}
>>> for k, v in Genes.items():
... stats['genes'] += 1
... if v['type'] == 'tRNA':
... stats['trna'] += 1
... elif v['type'] == 'rRNA':
... stats['rrna'] += 1
... elif v['type'] == 'ncRNA':
... stats['ncrna'] += 1
... elif v['type'] == 'mRNA':
... stats['mrna'] += 1
... for i in range(0, len(v['ids'])):
... stats['transcripts'] += 1
... if v['type'] == 'mRNA':
... if v['partialStart'][i]:
... stats['partial5'] += 1
... if v['partialStop'][i]:
... stats['partial3'] += 1
...
>>> stats
{'genes': 1734, 'trna': 110, 'rrna': 0, 'mrna': 1624, 'ncrna': 0, 'transcripts': 1737, 'partial5': 0, 'partial3': 0}
>>>
Would be relatively trivial to add this as an output from funannotate. Where would it make the most sense? A separate file? Or just print to the log file?
nice - I forgot the tbl file has the info too.
There is probably a way to parse just the partial information faster from tbl file, basically if there are '>' or '<' in the coordinates in the tbl file then it is partial, the above does a little extra turning it into the funannotate dictionary format -- but then other stats are a little easier to grab.
Thanks
On Fri, Sep 25, 2020 at 1:03 AM Jon Palmer notifications@github.com wrote:
There is probably a way to parse just the partial information faster from tbl file, basically if there are '>' or '<' in the coordinates in the tbl file then it is partial, the above does a little extra turning it into the funannotate dictionary format -- but then other stats are a little easier to grab.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/481#issuecomment-698721179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNO35QPRBRQG76VD7XWCR3SHQQANANCNFSM4RYR5QAQ .
-- Vanina L. Castroagudín, PhD
Postdoctoral Researcher
Mycology & Nematology Genetic Diversity & Biology Laboratory
USDA - ARS
10300 Baltimore Avenue, Bldg 10A, Room 227
Beltsville, MD 20705
Cell: (479) 313-3780
Vanina.Castroagudin@usda.gov vanina.castroagudin@usda.gov
v.castroagudin@gmail.com v.castroagudin@gmail.com
Jon, What happens with regulatory sequences in the predicted mRNA? If I am not including RNAseq data in the prediction/annotation, will the predicted mRNA always finish in a stop codon? what if the stop codon is not at the end of the predicted mRNA? wouldn't that interpret a complete protein as incomplete?
Vanina
Hi @vcastroagudin, @hyphaltip solution will work ad hoc, an alternative way would be to parse the tbl and fasta file like this:
>>> import funannotate.library as lib >>> Genes = {} >>> Genes = lib.tbl2dict('Awesome_rna.tbl', 'Awesome_rna.scaffolds.fa', Genes) >>> stats = {'genes': 0, 'trna': 0, 'rrna': 0, 'mrna': 0, 'ncrna': 0, 'transcripts': 0, 'partial5': 0, 'partial3':0} >>> for k, v in Genes.items(): ... stats['genes'] += 1 ... if v['type'] == 'tRNA': ... stats['trna'] += 1 ... elif v['type'] == 'rRNA': ... stats['rrna'] += 1 ... elif v['type'] == 'ncRNA': ... stats['ncrna'] += 1 ... elif v['type'] == 'mRNA': ... stats['mrna'] += 1 ... for i in range(0, len(v['ids'])): ... stats['transcripts'] += 1 ... if v['type'] == 'mRNA': ... if v['partialStart'][i]: ... stats['partial5'] += 1 ... if v['partialStop'][i]: ... stats['partial3'] += 1 ... >>> stats {'genes': 1734, 'trna': 110, 'rrna': 0, 'mrna': 1624, 'ncrna': 0, 'transcripts': 1737, 'partial5': 0, 'partial3': 0} >>>
Would be relatively trivial to add this as an output from funannotate. Where would it make the most sense? A separate file? Or just print to the log file?
This would be very useful! I think it could be very useful as a separate file in the annotation or prediction miscellaneous folder. Now, in it you mention ncrna.. does funannotate predict non-coding RNA loci????? Where are they listed????
Vanina
Jon, What happens with regulatory sequences in the predicted mRNA? If I am not including RNAseq data in the prediction/annotation, will the predicted mRNA always finish in a stop codon? what if the stop codon is not at the end of the predicted mRNA? wouldn't that interpret a complete protein as incomplete?
If there is no RNA-seq, then the pipeline won't predict any UTRs. So the mRNA will be the same as the cds-transcript. It will allow for partial genes, ie those that might span or end near a gap in assembly, end of contig, etc. So generally most of the gene models should have proper starts (ATG) and proper stops (TGA, TAA, TAG).
This would be very useful! I think it could be very useful as a separate file in the annotation or prediction miscellaneous folder. Now, in it you mention ncrna.. does funannotate predict non-coding RNA loci????? Where are they listed????
The code will support rRNA, tRNA, mRNA, ncRNA features -- although it only predicts tRNA and mRNA. The others can be added manually by editing the annotation tbl files for example, and then running funannotate fix
to re-generate the proper output files.
Jon do you have any favorite program for predicting ncRNA (other than tRNA) which results could be eventually parsed to funannotate with the fix command?
Thanks
Admittedly I've not looked into this very much -- mostly I add some ncRNA's that I find more/less manually from transcriptome data. I'm not sure if there is a pipeline that works well for capturing other ncRNAs?
Hi @vcastroagudin, I added a *.stats.json output file to the _results folder. There is also now a script in the funannotate util
menu that can yield the following results:
$ funannotate util stats
Usage: funannotate util stats <arguments>
version: 1.8.0
Description: Generate JSON file with genome assembly and annotation stats.
Arguments:
-f, --fasta Genome FASTA file (Required)
-o, --out Output file (JSON format)
-g, --gff3 Genome Annotation (GFF3 format)
-t, --tbl Genome Annotation (NCBI TBL format)
--transcript_alignments Transcript alignments (GFF3 format)
--protein_alignments Protein alignments (GFF3 format)
Example run and output:
$ funannotate util stats -f rna-seq/annotate_results/Awesome_rna.scaffolds.fa -o test.json -g rna-seq/annotate_results/Awesome_rna.gff3
Generating stats from Genome FASTA file and GFF3 annotation
Finished writing JSON stats file: test.json
$ cat test.json
{
"assembly": {
"num_contigs": 6,
"length": 3776588,
"mean_length": 629431.33,
"N50": 766242,
"L50": 2,
"N90": 449806,
"L90": 5,
"GC_content": 38.56
},
"annotation": {
"genes": 1727,
"common_name": 1421,
"mRNA": 1620,
"tRNA": 110,
"ncRNA": 0,
"rRNA": 0,
"avg_gene_length": 1404.69,
"transcript-level": {
"CDS_transcripts": 1620,
"CDS_five_utr": 0,
"CDS_three_utr": 73,
"CDS_no_utr": 1396,
"CDS_five_three_utr": 151,
"CDS_complete": 1620,
"CDS_no-start": 0,
"CDS_no-stop": 0,
"CDS_no-start_no-stop": 0,
"total_exons": 2067,
"total_cds_exons": 1678,
"multiple_exon_transcript": 62,
"single_exon_transcript": 1558,
"avg_exon_length": 482.04,
"avg_protein_length": 481.1,
"functional": {
"go_terms": 0,
"interproscan": 0,
"eggnog": 0,
"pfam": 1336,
"cazyme": 35,
"merops": 60,
"busco": 369,
"secretion": 0
}
}
}
Awesome Jon. Thanks so much!
Vanina
All the mRNA reported here are full length genes? Or we can further filter them?
Hello,
is there a way to know which is the percentage of complete sequences among from the ORFs/proteins predicted by Funannotate?
Thanks.