pfenninglab / halLiftover-postprocessing

18 stars 4 forks source link

HALPER to transfer genes #13

Open imu93 opened 1 week ago

imu93 commented 1 week ago

Hi @imk1, Although I know that the main purpose of HLPER is not to transfer protein-coding genes, I wonder if Halpes has been tested for this task. Let me explain my problem: I have two non-model species with 3 assemblies of different individuals. After genome alignment using progressive cactus I would like to determine the orthologs of species A assembly 1 on species B assembly 1. I was thinking of using the gene annotation as "regulatory regions" and exons as "summits". I wonder if this makes sense or if I'm misinterpreting how HALPER can be used.

I would appreciate any comments on this idea! Regards, IMU

imk1 commented 2 days ago

The summit needs to be an individual position. You could use the most mappable base of an exon, the most conserved base of an exon, or the center of an exon as the summit. Some members of the Zoonomia Consortium did use halLiftover with a different set of post-processing steps for genes. Here is the quote from the supplement describing what they did: "Identifying genes in Zoonomia genomes using halLiftover For the halLiftover-based ortholog identification, we began with human protein-coding and exon sequence annotation from ENSEMBL (BioMart v99; Downloaded March 2020) (182). For each gene investigated, we chose the longest transcript. We used halLiftover in conjunction with Zoonomia Cactus alignment (4) to identify sequences orthologous to the protein-coding sequence of each exon across each of the 241 assemblies. These sequences were prone to frameshift mutations, internal stop codons, and likely premature truncations. Those issues were likely due to missing sequence or small errors in the sequence alignments and amplified by the large evolutionary distance. We therefore performed two variations of post-processing steps. For transcripts marked “Pfenning,” we first choose the ortholog of the annotations from the most closely related of human, goat, or mouse, a set of high-quality genomes and annotations that span a large segment of the mammalian evolutionary tree. We mapped the set of ENSEMBL exons for the reference sequence to the target species using halLiftover on the Zoonomia Cactus alignment (11, 181). We matched each lifted over exon sequence fragment to reference exons by translating it into an amino acid sequence and shifting start and end points to match exon boundaries. We then merged exon boundaries only if they did not create a frameshift mutation, did not create an internal stop codon, or did not increase the size of the exon by greater than or equal to three-fold. For transcripts marked “Broad,” we use the halLiftover outputs from mapping the human gene coordinates to each species. We smoothed the halLiftover output for each protein-coding gene/species/contig combination by making a single interval from the first and last coordinates for 3 each gene ortholog in the halLiftover output. We padded both ends with 500bp and then, for each species, extracted the genome sequence in that interval. We then applied Exonerate protein2genome (183) to translate the original human gene into protein sequences and then to predict exons and introns in each species by finding sequences matching the human sequences within the smoothed, padded halLiftover outputs while accounting for splice site sequences. For both methods, we considered a transcript to be “valid” if the predicted protein sequence started with methionine, was contained on a single contig, and was within 90-110% of the length of the human reference protein. If a transcript was not “valid,” we repeated this process for the next-longest transcript for the gene; if there were no additional transcripts in a species, we did not report an annotation for that gene, species combination. If “valid” transcripts were found for multiple contigs, we reported only the first valid transcript found. We reported transcripts along with their identity and similarity scores (from Exonerate) on both the exon and transcript level. We also reported insertions and deletions (from Exonerate) at the exon level. For our final annotations, we excluded all annotations of genes that do not have human orthologs. Given that genes are sporadically missing from each genome due to genome quality issues, cataloging “essential” genes found in all placental mammals is not feasible. We annotated just 116 genes in all 240 Zoonomia species. When we consider only the highest-quality assembly in each order, this increases to 2,718 genes, still far fewer than the 9,226 included in the BUSCO version odb10 gene set of mammalian single-copy orthologs (184)." Please let me know if you have an specific questions about this, and I will pass the questions on to the researchers who did this.