molevol-ub / bitacora

BITACORA: A Bioinformatics tool for gene family annotation
GNU General Public License v3.0
41 stars 14 forks source link

How to identify pseudogenes in the output? #14

Closed fangrunzhao closed 2 weeks ago

fangrunzhao commented 3 weeks ago

Hi Vizueta,

I have the following questions after reading your articles(#Comparative Genomics Reveals Thousands of Novel Chemosensory Genes and Massive Changes in Chemoreceptor Repertories across Chelicerates #Evolutionary History of Major Chemosensory Gene Families across Panarthropoda).

  1. Because I see a lot of protein sequences with short lengths in the output of the bitacora, may I ask if the output of the bitacora includes pseudogenes?
  2. If the output contains pseudogenes, does bitacora discriminate between pseudogenes and genes?
  3. If the output does not contain pseudogenes, what should be done to identify those sequences with premature stop codons, as described in your article? image

Sincerely, Run Zhao Fang

Vizueta commented 3 weeks ago

Hi Run Zhao Fang,

  1. BITACORA annotates all homolog sequences for the input gene family. Some of the annotated partial sequences might be pseudogenes or just putative complete copies that cannot be annotated because of genome fragmentation.
  2. No, it does not discriminate between pseudogenes or complete genes, the output includes all homolog sequences.
  3. I have a script in "Scripts/Tools/get_genes_partial_pseudo.pl" that allows to split complete from partial genes based on the minimum expected protein length for complete copies.

In addition, if you want to further check the partial sequences and identify pseudogenes, you can use BITACORA annotations in a web browser such as Apolo to manually inspect and curate the gene family annotations. I recommend that you read our detailed book chapter https://doi.org/10.1016/bs.mie.2020.05.015

Best, Joel

fangrunzhao commented 2 weeks ago

Hi Vizueta,

Thank you for your detailed answer!

Best, Run Zhao Fang