Open JanaSperschneider opened 1 year ago
Hi @JanaSperschneider, certainly looks incorrect from what you posted above. Is it possible to share the FASTA + GFF3 files so I can see if I can replicate? I did pull out the annotation processing part of funannotate to a separate repo/tool -- https://github.com/nextgenusfs/gfftk -- the eventual plan is that gfftk will be a dependency of funannotate. I'm curious if it would have the same behavior.
I think I see where the error is -- the conversion code was originally written to phase/identify de novo alternative transcripts and order them by relative expression value -- so the default funanntoate internal options are enabled in funannotate util gff2tbl
as it just re-uses that function, which I think is likely causing the error. This should hopefully not be happening in gfftk, but if you are unable to test gfftk I can push the change to funannotate repo which could involve adding an option to that script.
If you want to try gfftk, its a single entry point script can get you to transcripts/proteins/etc -- no real need to get to gbk format:
$ gfftk convert
usage: gfftk convert -f -i [-o] [--input-format] [--output-format] [-n] [--debug] [-h] [--version]
convert GFF3/tbl format into another format [output gff3, gtf, tbl, fasta].
Required arguments:
-f , --fasta genome in FASTA format
-i , --input annotation in GFF3 or TBL format
Optional arguments:
-o , --out write converted output to file (default: stdout)
--input-format format of input file [gff3, tbl]. (default: auto)
--output-format format of output file [gff3, gtf, tbl, proteins, transcripts, cds-transcripts]. (default: auto)
-n, --no-stop for proteins output, do not write stop codons (*) (default: False)
--debug write parsing errors to stderr (default: False)
Other arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Thanks @nextgenusfs, I am attaching the files for this example.
It would be great to fix it. I was using the transcripts from gff3 -> gbk -> fasta for Salmon quantification, and thus assigned the wrong expression to the isoforms. That's how I noticed something wasn't right.
Cheers, Jana example.zip
Sorry about the mix-up! This isn't widely known I don't think, but GBK format does not have a mechanism/method to deal with alternative transcripts properly -- meaning that when you create a GBK file that has alternative transcripts there is no way to match up the proper 'mRNA' feature with the correct corresponding 'CDS' feature. Even more frustrating is that using tbl2asn
to generate the GBK files it will sometimes write them out of order, ie the transcript for -T1 followed by the CDS for -T2. I tried many years ago to come up with a way to try to fix this, but I could never entirely figure it out and there were always edge cases that caused problems. You'll be better off writing the transcripts directly from the GFF3 where the alternative transcripts are properly named and are linked to the proper "parents".
Okay, so seems to be correct in gfftk
:
$ cat example.gff3
##gff-version 3
chr1_C GenBank gene 30412 32485 . - . ID=20QLD87_000005;
chr1_C GenBank mRNA 30412 32485 . - . ID=20QLD87_000005-T1;Parent=20QLD87_000005;product=hypothetical protein;
chr1_C GenBank exon 30412 32299 . - . ID=20QLD87_000005-T1.exon1;Parent=20QLD87_000005-T1;
chr1_C GenBank CDS 32462 32485 . - 0 ID=20QLD87_000005-T1.cds;Parent=20QLD87_000005-T1;
chr1_C GenBank CDS 30823 31917 . - 0 ID=20QLD87_000005-T1.cds;Parent=20QLD87_000005-T1;
chr1_C GenBank mRNA 30412 32485 . - . ID=20QLD87_000005-T2;Parent=20QLD87_000005;product=hypothetical protein;
chr1_C GenBank exon 32462 32485 . - . ID=20QLD87_000005-T2.exon1;Parent=20QLD87_000005-T2;
chr1_C GenBank exon 30823 31917 . - . ID=20QLD87_000005-T2.exon2;Parent=20QLD87_000005-T2;
chr1_C GenBank CDS 30823 31872 . - 0 ID=20QLD87_000005-T2.cds;Parent=20QLD87_000005-T2;
$ gfftk convert -f chr1_C.fasta -i example.gff3 -o example.tbl
$ cat example.tbl
>Feature chr1_C
1 40000 REFERENCE
CFMR 12345
32485 30412 gene
locus_tag 20QLD87_000005
32299 30412 mRNA
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T1_mrna
protein_id gnl|ncbi|20QLD87_000005-T1
32485 32462 CDS
31917 30823
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T1_mrna
protein_id gnl|ncbi|20QLD87_000005-T1
32485 32462 mRNA
31917 30823
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T2_mrna
protein_id gnl|ncbi|20QLD87_000005-T2
31872 30823 CDS
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T2_mrna
protein_id gnl|ncbi|20QLD87_000005-T2
$ gfftk convert -f chr1_C.fasta -i example.gff3 -o example.fasta --output-format proteins
$ cat example.fasta
>20QLD87_000005-T1 20QLD87_000005
MDFTQNLEVLFTGIACPNSPLQKMKTWRWLCLLPLFTENIQAGFTRTTASLMTAPEALSSVKPLNAAKDLKVPEASFKSS
QKVAEGLKGAPSVSNSKSQEAITMRAHKDKTANPGNKATEENCPHGRITRRSLVTLRRRGCFGRVGSGLIAASNGLANAG
TGLQRFDKLTAGQKLSWLQRKTINGVKHTVSVLVVKPVKGLAFLGRKTFQAAMAVIRGTGHVAAKTGAVAAKGVFISGYI
VYKAGTGMLKVIGRLALHAWKYFGKGLAATGRGLQKAGKGMEAQSATRLAKKVPHPHPNAPVPPHPYVRAPPYPYAYAPP
PQYAHAPPPPYAHAPPHQFAPTPPHQFTPTPPHQFAPTSPHQVPPPPYQPPK*
>20QLD87_000005-T2 20QLD87_000005
MKTWRWLCLLPLFTENIQAGFTRTTASLMTAPEALSSVKPLNAAKDLKVPEASFKSSQKVAEGLKGAPSVSNSKSQEAIT
MRAHKDKTANPGNKATEENCPHGRITRRSLVTLRRRGCFGRVGSGLIAASNGLANAGTGLQRFDKLTAGQKLSWLQRKTI
NGVKHTVSVLVVKPVKGLAFLGRKTFQAAMAVIRGTGHVAAKTGAVAAKGVFISGYIVYKAGTGMLKVIGRLALHAWKYF
GKGLAATGRGLQKAGKGMEAQSATRLAKKVPHPHPNAPVPPHPYVRAPPYPYAYAPPPQYAHAPPPPYAHAPPHQFAPTP
PHQFTPTPPHQFAPTSPHQVPPPPYQPPK*
Looks like the gfftk convert --output-format nucleotides
needs a quick fix so you can go directly to transcripts for salmon/kallisto from GFF3. Let me see if I can get it working quickly, you could just install in existing funannotate environment. Give me a few minutes....
Great, thanks for this! Yes I agree writing out the FASTAs from the gff3 is the safest and quickest way :-)
Okay you can install the latest from gfftk repo like this:
python -m pip install git+https://github.com/nextgenusfs/gfftk.git
And then it will allow you to convert to either full length transcripts (including the UTRs, etc) or also just cds-transcripts:
$ gfftk convert -f chr1_C.fasta -i example.gff3 --output-format cds-transcripts
>20QLD87_000005-T1 20QLD87_000005
ATGGACTTCACTCAAAACCTCGAGGTCCTATTCACGGGTATCGCCTGTCCAAACTCCCCACTGCAGAAGATGAAGACATG
GAGATGGCTTTGCCTCCTTCCTCTTTTTACAGAAAATATCCAGGCTGGTTTCACTCGCACTACAGCTAGCTTGATGACGG
CCCCCGAAGCGTTATCATCAGTCAAGCCCCTCAACGCGGCAAAGGATCTAAAAGTTCCCGAAGCAAGTTTCAAAAGTAGT
CAAAAAGTAGCGGAAGGTCTTAAAGGAGCTCCTTCGGTGAGCAATTCAAAATCGCAGGAGGCAATTACGATGAGGGCGCA
CAAAGACAAAACTGCAAACCCAGGCAACAAAGCAACAGAAGAAAATTGCCCCCATGGTCGTATCACAAGAAGAAGTCTCG
TTACGCTCAGACGTCGCGGTTGTTTTGGTAGAGTAGGCTCAGGGTTAATTGCTGCGTCTAATGGACTGGCAAATGCAGGA
ACTGGATTGCAGCGTTTCGACAAATTAACTGCTGGTCAGAAACTATCATGGCTTCAAAGAAAAACAATAAACGGAGTTAA
GCACACAGTTTCTGTCCTGGTGGTCAAGCCGGTTAAAGGACTGGCCTTCCTAGGAAGAAAAACGTTTCAAGCAGCCATGG
CGGTTATCCGTGGGACCGGTCACGTGGCCGCAAAAACAGGTGCGGTGGCTGCCAAGGGAGTTTTTATTTCAGGATATATA
GTCTATAAAGCTGGCACGGGAATGCTCAAGGTGATAGGCCGACTGGCCCTTCATGCTTGGAAATATTTCGGAAAGGGATT
GGCAGCAACAGGAAGAGGGCTCCAAAAGGCGGGCAAGGGCATGGAGGCTCAATCGGCGACACGGCTCGCCAAGAAAGTCC
CGCATCCGCATCCGAACGCCCCTGTGCCCCCACATCCGTATGTGCGTGCGCCCCCATATCCGTATGCGTATGCGCCCCCA
CCTCAGTATGCGCATGCGCCCCCACCTCCGTATGCGCATGCGCCCCCACATCAGTTTGCGCCCACTCCCCCACATCAGTT
CACACCCACTCCCCCACATCAGTTCGCGCCCACTTCCCCACATCAAGTCCCGCCGCCGCCGTATCAGCCTCCGAAGTGA
>20QLD87_000005-T2 20QLD87_000005
ATGAAGACATGGAGATGGCTTTGCCTCCTTCCTCTTTTTACAGAAAATATCCAGGCTGGTTTCACTCGCACTACAGCTAG
CTTGATGACGGCCCCCGAAGCGTTATCATCAGTCAAGCCCCTCAACGCGGCAAAGGATCTAAAAGTTCCCGAAGCAAGTT
TCAAAAGTAGTCAAAAAGTAGCGGAAGGTCTTAAAGGAGCTCCTTCGGTGAGCAATTCAAAATCGCAGGAGGCAATTACG
ATGAGGGCGCACAAAGACAAAACTGCAAACCCAGGCAACAAAGCAACAGAAGAAAATTGCCCCCATGGTCGTATCACAAG
AAGAAGTCTCGTTACGCTCAGACGTCGCGGTTGTTTTGGTAGAGTAGGCTCAGGGTTAATTGCTGCGTCTAATGGACTGG
CAAATGCAGGAACTGGATTGCAGCGTTTCGACAAATTAACTGCTGGTCAGAAACTATCATGGCTTCAAAGAAAAACAATA
AACGGAGTTAAGCACACAGTTTCTGTCCTGGTGGTCAAGCCGGTTAAAGGACTGGCCTTCCTAGGAAGAAAAACGTTTCA
AGCAGCCATGGCGGTTATCCGTGGGACCGGTCACGTGGCCGCAAAAACAGGTGCGGTGGCTGCCAAGGGAGTTTTTATTT
CAGGATATATAGTCTATAAAGCTGGCACGGGAATGCTCAAGGTGATAGGCCGACTGGCCCTTCATGCTTGGAAATATTTC
GGAAAGGGATTGGCAGCAACAGGAAGAGGGCTCCAAAAGGCGGGCAAGGGCATGGAGGCTCAATCGGCGACACGGCTCGC
CAAGAAAGTCCCGCATCCGCATCCGAACGCCCCTGTGCCCCCACATCCGTATGTGCGTGCGCCCCCATATCCGTATGCGT
ATGCGCCCCCACCTCAGTATGCGCATGCGCCCCCACCTCCGTATGCGCATGCGCCCCCACATCAGTTTGCGCCCACTCCC
CCACATCAGTTCACACCCACTCCCCCACATCAGTTCGCGCCCACTTCCCCACATCAAGTCCCGCCGCCGCCGTATCAGCC
TCCGAAGTGA
$ gfftk convert -f chr1_C.fasta -i example.gff3 --output-format transcripts
>20QLD87_000005-T1 20QLD87_000005
GGAATATCGACTTCGCAACCCAGCCACCAACCTTCCTCCCTGTTTTCTATCCAACTTTCCGTCTCAATTCACATCAAAAT
TAATCCATTCCCTTCATTCTTTATCGCTGGGCACTTCCGATATCATAACCTGGTGGTTCATTCTGTCGTAGTGACTAAAG
ATCAATTACACTTCACCAGTATCTCGCAAGGAGGTCCAATCGTCAACTTATCAGAACTTTGATAAAATCAACATCGCACT
CGTTAACATTTAACTTATTATACTTCTTTCCTTTGTGATTTTGGGTAGCAGATAAATTTTTAGTCTCCCTTATCTCCACG
TATCAGTTTTCGTAAAAGCAGGGCACCGCAAGATACCAAAGCTCATAACCTGGGTGAAGTAGGTCCTATTCACGGGTATC
GCCTGTCCAAACTCCCCACTGCAGAAGATGAAGACATGGAGATGGCTTTGCCTCCTTCCTCTTTTTACAGAAAATATCCA
GGCTGGTTTCACTCGCACTACAGCTAGCTTGATGACGGCCCCCGAAGCGTTATCATCAGTCAAGCCCCTCAACGCGGCAA
AGGATCTAAAAGTTCCCGAAGCAAGTTTCAAAAGTAGTCAAAAAGTAGCGGAAGGTCTTAAAGGAGCTCCTTCGGTGAGC
AATTCAAAATCGCAGGAGGCAATTACGATGAGGGCGCACAAAGACAAAACTGCAAACCCAGGCAACAAAGCAACAGAAGA
AAATTGCCCCCATGGTCGTATCACAAGAAGAAGTCTCGTTACGCTCAGACGTCGCGGTTGTTTTGGTAGAGTAGGCTCAG
GGTTAATTGCTGCGTCTAATGGACTGGCAAATGCAGGAACTGGATTGCAGCGTTTCGACAAATTAACTGCTGGTCAGAAA
CTATCATGGCTTCAAAGAAAAACAATAAACGGAGTTAAGCACACAGTTTCTGTCCTGGTGGTCAAGCCGGTTAAAGGACT
GGCCTTCCTAGGAAGAAAAACGTTTCAAGCAGCCATGGCGGTTATCCGTGGGACCGGTCACGTGGCCGCAAAAACAGGTG
CGGTGGCTGCCAAGGGAGTTTTTATTTCAGGATATATAGTCTATAAAGCTGGCACGGGAATGCTCAAGGTGATAGGCCGA
CTGGCCCTTCATGCTTGGAAATATTTCGGAAAGGGATTGGCAGCAACAGGAAGAGGGCTCCAAAAGGCGGGCAAGGGCAT
GGAGGCTCAATCGGCGACACGGCTCGCCAAGAAAGTCCCGCATCCGCATCCGAACGCCCCTGTGCCCCCACATCCGTATG
TGCGTGCGCCCCCATATCCGTATGCGTATGCGCCCCCACCTCAGTATGCGCATGCGCCCCCACCTCCGTATGCGCATGCG
CCCCCACATCAGTTTGCGCCCACTCCCCCACATCAGTTCACACCCACTCCCCCACATCAGTTCGCGCCCACTTCCCCACA
TCAAGTCCCGCCGCCGCCGTATCAGCCTCCGAAGTGAATTTTATCATAATGATTTTGGAACCAAACTCTGTAATCCTCAA
GGCTTTCCTTCCCTTGATGCTCCTCCTTGGATTTCACCGGAATTCTTCGATAAAAAGATGCAATGGACAATCTACAAACA
CTGTTACAATGGACATTCTGGTTTGCAAGCAGACAACCAACATGTATTTACATACATACAAGACAAAAGGAGACATGGGT
TAGAATAGGTCCAGTAGATGGCAATCTTTGGCTGTTTGGATTAGATACCTCACCATTAGGCTTTCAGCTGATGATTGTCA
CAACCTGCTATGACAGCTGCATTTCATCTTCCTGGTTTCTTTTTCTTGAATGTACTTCAAATTGGTGAGGGGAAGCTTTG
CTTTTCCAGCTGCTTGGTTTTTCTTTTGGAGTGGAGAATGAATCACTA
>20QLD87_000005-T2 20QLD87_000005
ATGGACTTCACTCAAAACCTCGAGGTCCTATTCACGGGTATCGCCTGTCCAAACTCCCCACTGCAGAAGATGAAGACATG
GAGATGGCTTTGCCTCCTTCCTCTTTTTACAGAAAATATCCAGGCTGGTTTCACTCGCACTACAGCTAGCTTGATGACGG
CCCCCGAAGCGTTATCATCAGTCAAGCCCCTCAACGCGGCAAAGGATCTAAAAGTTCCCGAAGCAAGTTTCAAAAGTAGT
CAAAAAGTAGCGGAAGGTCTTAAAGGAGCTCCTTCGGTGAGCAATTCAAAATCGCAGGAGGCAATTACGATGAGGGCGCA
CAAAGACAAAACTGCAAACCCAGGCAACAAAGCAACAGAAGAAAATTGCCCCCATGGTCGTATCACAAGAAGAAGTCTCG
TTACGCTCAGACGTCGCGGTTGTTTTGGTAGAGTAGGCTCAGGGTTAATTGCTGCGTCTAATGGACTGGCAAATGCAGGA
ACTGGATTGCAGCGTTTCGACAAATTAACTGCTGGTCAGAAACTATCATGGCTTCAAAGAAAAACAATAAACGGAGTTAA
GCACACAGTTTCTGTCCTGGTGGTCAAGCCGGTTAAAGGACTGGCCTTCCTAGGAAGAAAAACGTTTCAAGCAGCCATGG
CGGTTATCCGTGGGACCGGTCACGTGGCCGCAAAAACAGGTGCGGTGGCTGCCAAGGGAGTTTTTATTTCAGGATATATA
GTCTATAAAGCTGGCACGGGAATGCTCAAGGTGATAGGCCGACTGGCCCTTCATGCTTGGAAATATTTCGGAAAGGGATT
GGCAGCAACAGGAAGAGGGCTCCAAAAGGCGGGCAAGGGCATGGAGGCTCAATCGGCGACACGGCTCGCCAAGAAAGTCC
CGCATCCGCATCCGAACGCCCCTGTGCCCCCACATCCGTATGTGCGTGCGCCCCCATATCCGTATGCGTATGCGCCCCCA
CCTCAGTATGCGCATGCGCCCCCACCTCCGTATGCGCATGCGCCCCCACATCAGTTTGCGCCCACTCCCCCACATCAGTT
CACACCCACTCCCCCACATCAGTTCGCGCCCACTTCCCCACATCAAGTCCCGCCGCCGCCGTATCAGCCTCCGAAGTGA
So for clarification, the problem here is going through GBK format and that is causing the identifiers to be inverted due to limitation in the the Genbank format for alternative transcripts.
$ funannotate util gff2tbl -g example.gff3 -f chr1_C.fasta
>Feature chr1_C
1 40000 REFERENCE
CFMR 12345
32485 30412 gene
locus_tag 20QLD87_000005
32299 30412 mRNA
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T1_mrna
protein_id gnl|ncbi|20QLD87_000005-T1
32485 32462 CDS
31917 30823
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T1_mrna
protein_id gnl|ncbi|20QLD87_000005-T1
32485 32462 mRNA
31917 30823
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T2_mrna
protein_id gnl|ncbi|20QLD87_000005-T2
31872 30823 CDS
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|20QLD87_000005-T2_mrna
protein_id gnl|ncbi|20QLD87_000005-T2
Are you using the latest release? 1.8.13
Hi there,
I am using funannotate to convert a gff3 to a .tbl file, then to convert the .tbl file to a .gbk file, and then to spit out the transcript/protein FASTAs.
However I discovered a mix-up between the T1/T2 sequences using this .tbl file:
When I translate the CDS of 20QLD87_000005-T1 (32462-32485 + 30823-31917) I get this protein:
MDFTQNLEVLFTGIACPNSPLQKMKTWRWLCLLPLFTENIQAGFTRTTASLMTAPEALSSVKPLNAAKDLKVPEASFKSSQKVAEGLKGAPSVSNSKSQEAITMRAHKDKTANPGNKATEENCPHGRITRRSLVTLRRRGCFGRVGSGLIAASNGLANAGTGLQRFDKLTAGQKLSWLQRKTINGVKHTVSVLVVKPVKGLAFLGRKTFQAAMAVIRGTGHVAAKTGAVAAKGVFISGYIVYKAGTGMLKVIGRLALHAWKYFGKGLAATGRGLQKAGKGMEAQSATRLAKKVPHPHPNAPVPPHPYVRAPPYPYAYAPPPQYAHAPPPPYAHAPPHQFAPTPPHQFTPTPPHQFAPTSPHQVPPPPYQPPK
When I predict the ORF in the mRNA of 20QLD87_000005-T1 (30412-32299) I get this protein, which is actually 20QLD87_000005-T2!
MKTWRWLCLLPLFTENIQAGFTRTTASLMTAPEALSSVKPLNAAKDLKVPEASFKSSQKVAEGLKGAPSVSNSKSQEAITMRAHKDKTANPGNKATEENCPHGRITRRSLVTLRRRGCFGRVGSGLIAASNGLANAGTGLQRFDKLTAGQKLSWLQRKTINGVKHTVSVLVVKPVKGLAFLGRKTFQAAMAVIRGTGHVAAKTGAVAAKGVFISGYIVYKAGTGMLKVIGRLALHAWKYFGKGLAATGRGLQKAGKGMEAQSATRLAKKVPHPHPNAPVPPHPYVRAPPYPYAYAPPPQYAHAPPPPYAHAPPHQFAPTPPHQFTPTPPHQFAPTSPHQVPPPPYQPPK
It looks to me like the mRNA coordinates in the .tbl file are swapped around, which leads to T1/T2 having swapped transcript sequences?
The gff3 file looks like this and you can see that the .tbl file resulting from a conversion with funannotate util gff2tbl has the T1/T2 assignments swapped around.
Could there be a bug in the funannotate util gff2tbl script?
Thanks! Jana