nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 87 forks source link

Utils to .faa to primarytranscript.py #717

Closed DDrabeck closed 2 years ago

DDrabeck commented 2 years ago

Hi!

I have a used funannotate now to annotate ~14 genomes. I am hoping to use these annotations as input to orthofinder. However the first step in this pipeline is to use a script primarytranscript.py to select the longest transcript for each gene to use as input.

When I use this on pep.all or protein.fa files from Ensembl or NCBI it works just fine. However it doesn't alter the final proteins.fasta file because the protein file generated is not in NCBI or Ensembl format (doesn't include a gene tag)

ex:

ENSAPOP00000010247.1 pep primary_assembly:ASM210954v1:MVNR01007420.1:817:873:1 gene:ENSAPOG00000012658.1 transcript:ENSAPOT00000000162.1 gene_biotype:IG_J_gene transcript_biotype:IG_J_gene SYFDYWGKGTQVTVTSGK

I'm trying to mess with a python script to write a new pep file... but this seems like a problem that others might have already had/solved. Is there any good way to extract a pep file in NCBI format from the final products (gbk) of funannotate?

nextgenusfs commented 2 years ago

So why can't you just use the final output *.proteins.fa file and modify the headers? The header is >transcript_id gene_id.

DDrabeck commented 2 years ago

hmmm ok maybe this is a data issue... So my proteins.fa file has output:

MBRI_000001-T1 MBRI_000001

Which I would need to look like this

MBRI_000001-T1 gene:MBRI_000001

But in doing a quick ctl+F I see that there are no T2s in the entire file. So each funannotate gene is only showing one transcript per gene. Which is great except when I run orthofinder my funannotate genomes have a ton of 'duplications' at these taxa (at the tips of the tree), which to me is indicating that I have a lot of repeated transcripts per gene.

I don't know why funnannotate is giving just one transcript per gene (if this is a data issue or a data extraction issue). For example primarytranscript.py uses several steps to identify duplicates (gene ids, presence of the string 'isoform' in a header line etc...). But I can't tell the difference because I don't have a .fa file with all the same info in it from my funannotate pipeline.

ex:

ENSAPOP00000034950.1 pep primary_assembly:ASM210954v1:MVNR01000001.1:4690:289744:1 gene:ENSAPOG00000024566.1 transcript:ENSAPOT00000035488.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:si:ch211-186j3.6 description:neural-cadherin-like [Source:NCBI gene;Acc:110950811] MHTWISARHWRFSAVEPKRYPVMIWVAFLPFLVSCAVGNRTTEQVPFFHGHVFENSSPGS....

Does that make sense?

hyphaltip commented 2 years ago

yeah it is a short python/perl script to get longest from the protein file eg here's a perl script that works for ensembl formatted headers https://github.com/hyphaltip/genome-scripts/blob/master/seq/get_longest_peptide.pl

hyphaltip commented 2 years ago

If there aren't -T2 then the duplications you see in the gene trees relate to gene level duplications not isoforms. But as to how this is detected it really depends on how the isoform/gene information is encoded in the FastA header, which doesn't have an enforced standard, hence hacky solutions for different data sources. funannotate uses the >transcript gene model of information so you can just read each protein out of the file and use the gene as key in the dictionary, keep the longest version, and print that out at the end. I'll post a simple script if you need.

hyphaltip commented 2 years ago

funannotate can only predict multiple isoforms if you run with RNAseq and do the update step with RNAseq to allow PASA to predict isoforms

DDrabeck commented 2 years ago

I see, so my thought was that if I extracted a file that was like this:

MBRI001-T1 MBRI001 PTCHD1_1 JHDFLKJSDFLKJSDFH

then I might see many funannotate genes with the same gene abbreviations (gene="PTCHD1_1") but it sounds like this is not the case, and that because I don't have RNA-seq data for these I will likely have to assume that my high numbers of gene in species specific orthogroups in because I probably have lots of isoforms being annotated as similar genes?

hyphaltip commented 2 years ago

PTCHD1_1 is about gene "functional" assignment but MBRI001 is the gene locus name - take a look at your GFF file and you can see gene location info - isoforms will be multiple mRNA features for the same gene. grep PTCHD1_1 [YOURGFFFILE] will give locations of the features with that annotation in the file grep MBRI001 [YOURGFFFILE] will give you the location of the MBRI001 locus and by virtue of the shared name with the mRNA transcript you'll see if there are multiple isoforms.

DDrabeck commented 2 years ago

While I still am hoping to get a file that is uniform with NCBI format I'm gathering that perhaps submitting these to NCBI with the funannotate output will generate the files I want. (for the sake of also having a uniform pipeline with NCBI-derived genomes and our newly generated genomes)?

Perhaps a better question to ask here is why I might see excessive species specific orthogroups?

DDrabeck commented 2 years ago

(From just a cursory glance it seems that all the individual locus tags have one RNA each, which makes sense why there would then be only one transcript per gene in my .fa file).

hyphaltip commented 2 years ago

While I still am hoping to get a file that is uniform with NCBI format I'm gathering that perhaps submitting these to NCBI with the funannotate output will generate the files I want. (for the sake of also having a uniform pipeline with NCBI-derived genomes and our newly generated genomes)?

Can you expand on this - i don't understand this question. What do you need that isn't uniform?

hyphaltip commented 2 years ago

Perhaps a better question to ask here is why I might see excessive species specific orthogroups?

DDrabeck commented 2 years ago

Ah so (and maybe this doesn't matter after all given that there are not actually isoforms in this data as I had originally thought)

If I wget from NCBI NCBI proteome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/109/545/GCF_002109545.1_ASM210954v1/GCF_002109545.1_ASM210954v1_protein.faa.gz

XP_022042467.1 MAP7 domain-containing protein 1 isoform X5 [Acanthochromis polyacanthus] MNKRTESEGIAAGMEENTLPQATYPCSVLNTNATRPDPEGESSPLKTDNTQRMESPVKKDLDRRLTTPTKSDVIPRSPGS PASPRSKRDVMKSEERQKLAKERREEKAKYLAAKKTQWLEKEEKARQLREQQLEERRRKLEEQRIKAEKRRAALEERQKQ

From Ensembl (link http://ftp.ensembl.org/pub/release-106/fasta/acanthochromis_polyacanthus/pep/Acanthochromis_polyacanthus.ASM210954v1.pep.all.fa.gz)

ENSAPOP00000010247.1 pep primary_assembly:ASM210954v1:MVNR01007420.1:817:873:1 gene:ENSAPOG00000012658.1 transcript:ENSAPOT00000000162.1 gene_biotype:IG_J_gene transcript_biotype:IG_J_gene SYFDYWGKGTQVTVTSGK

.fa file from funannotate:

MBRI_000001-T1 MBRI_000001 MELEHLGGGGVKPKRDSSRLYRIALVMCVCLSAALLLALILVSQKSSQEEFCLTPECIEAAGSIITKVDQSINPCEDFYR

Using this script to parse the longest transcript (https://gitlab.mbb.univ-montp2.fr/remy/orthofinder/-/blob/87d3c77ff3ad8194c3ea21ca01856ac9cc1f256d/orthofinder/tools/primary_transcript.py) I'm sort of a python newbie so I can't tell entirely which attributes/tags this script is using to parse the final list. But I know that for example (acanthochromis_polyacanthus) .fa here this script is removing many multiple transcripts (28GB to 12GBB). However, for my funannotate .fa it does not remove any/recognize any to remove.

DDrabeck commented 2 years ago
  • It depends on how close your comparison species are.
  • Are you masking repeats well, TEs can often show up as species-specific genes and will be high copy number
  • have you look at the Pfam domains of the species-specific high copy-number families? Screen these multi-copy families for what the interpro/pfam results showed - this would be as simple as looking at the .annotations.txt file in the annotate_results folder and using grep to search for some of your gene loci
  • have you plotted the locations of the species-specific genes on the chromosomes, are they clustered or dispersed? Species-specific families can be found in the sub-telomeric regions of many species
  • Centromere located and TE families will often appear as large species-specific families in ortholog tables so you would want to try and make sure these are properly masked

This is extremely helpful! Thank you! I will dive into this!

nextgenusfs commented 2 years ago

Lots of good info here thanks @hyphaltip.

A little more clarification on when funannotate calls multiple transcripts -T1, -T2, -T3: as @hyphaltip mentioned this only happens when you have RNA-seq data and there is actual evidence of multiple transcripts at a given locus/gene. The transcripts are then ordered based on expression (TPM) from the RNA-seq data, ie -T1 is the most abundant transcript for that particular locus. So even if you did have RNA-seq data and annotation with multiple alternatively transcribed transcripts -- the "take the longest transcript" approach doesn't actually make a lot of sense to me (ie you could very well be selecting some very rare isoform at a gene/locus that perhaps isn't relevant).

Another aside is that it can be difficult to compare genomes that are annotated with different methods, ie NCBI, ensemble, etc -- often the annotation method is not described well and/or is rather simplistic (people often just run augustus pre-trained on a somewhat close species and submit to NCBI). So it should be fairly easy to imagine that comparing protein predictions that are pulled from annotations of varying qualities could be problematic. So just something that you might want to keep in mind when interpreting these types of data/comparisons. As always, it depends on what questions you are asking from the data on how important this is to your analysis.

DDrabeck commented 2 years ago

Thank you both for this! It helps tremendously to clarify how these are called and the various things i need to be aware of in using this data!!

On Tue, Apr 26, 2022 at 7:16 PM Jon Palmer @.***> wrote:

Lots of good info here thanks @hyphaltip https://github.com/hyphaltip.

A little more clarification on when funannotate calls multiple transcripts -T1, -T2, -T3: as @hyphaltip https://github.com/hyphaltip mentioned this only happens when you have RNA-seq data and there is actual evidence of multiple transcripts at a given locus/gene. The transcripts are then ordered based on expression (TPM) from the RNA-seq data, ie -T1 is the most abundant transcript for that particular locus. So even if you did have RNA-seq data and annotation with multiple alternatively transcribed transcripts -- the "take the longest transcript" approach doesn't actually make a lot of sense to me (ie you could very well be selecting some very rare isoform at a gene/locus that perhaps isn't relevant).

Another aside is that it can be difficult to compare genomes that are annotated with different methods, ie NCBI, ensemble, etc -- often the annotation method is not described well and/or is rather simplistic (people often just run augustus pre-trained on a somewhat close species and submit to NCBI). So it should be fairly easy to imagine that comparing protein predictions that are pulled from annotations of varying qualities could be problematic. So just something that you might want to keep in mind when interpreting these types of data/comparisons. As always, it depends on what questions you are asking from the data on how important this is to your analysis.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/717#issuecomment-1110368976, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTRQPEUFWC4CM7WKO55APLVHCBOXANCNFSM5UM5OACA . You are receiving this because you authored the thread.Message ID: @.***>

-- Danielle H Drabeck PhD

Postdoctoral fellowDepartment of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @.

She/Her/Hers