hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
165 stars 22 forks source link

Question about output alignments #157

Open d6948739 opened 7 months ago

d6948739 commented 7 months ago

Hello! Hope you are all doing well! For the output alignments (nucleotide.fasta, codon.fasta, and protein.fasta), what gene projections/orthology classes are included here? By that I mean, what does TOGA use to filter before preparing these alignments? Do they only include I/PI/UL or do they include all inferred loss/gain categories (including PG)? I am asking because I would like to validate the orthology inferences done by TOGA with Orthofinder to get an idea of how well it infers orthology for my study group (a family of butterflies).

MichaelHiller commented 7 months ago

We decided to include all projections (meaning a transcript mapped to an orthologous query locus) in these alignments, irrespective of the classification.

E.g. here is a lost projection

grep ENST00000372915.MACF1.572 loss_summ_data.tsv
PROJECTION  ENST00000372915.MACF1.572   L

grep ENST00000372915.MACF1.572 codon.fasta -A 1| h
>ENST00000372915.MACF1.572 | CODON | REFERENCE
ATG TCT TCC TCA GAT GAA GAG ACG CTC AGT GAG CGG TCA TGT CGG AGT GAG CGG TCT TGT CGG AGT GAG CGA TCT TAC AGG AGC GAG CGG TCG GGG AGC CTG TCT CCC TGT CCC CCA GGG GAC ACC TTG CCC TGG AAC CTG CCA CTG CAT GAG CAG AAA AAG CGG AAA AGC CAG GAT TCG GTG CTG GAC CCT GCA GAG CGT GCT GTG GTC AGA GTC GCT GAT GAA CGG GAC CGG GTT CAG AAG AAA ACG TTC ACC AAG TGG GTC AAC AAG CAC TTA ATG AAG GTC CGC AAG CAC ATC AAT GAT CTT TAT GAA GAT CTG CGG GAT GGC CAT AAC CTG ATC TCT CTG TTG GAG GTC CTC TCA GGC ATC AAA CTG CCC CGG GAG AAG GGC AGG ATG CGT TTT CAT AGG CTG CAG AAT GTG CAG ATT GCC CTG GAC TTC CTA AAG CAG CGA CAG GTG AAA CTA GTG AAT ATT CGC AAT GAT GAC ATC ACA GAT GGC AAC CCC AAG TTG ACC CTG GGT CTG ATC TGG ACC ATT ATT TTG CAT TTC CAG ATC TCT GAC ATC TAC ATT AGT GGA GAA TCA GGG GAT ATG TCA GCC AAG GAG AAA CTA CTC CTG TGG ACC CAG AAG GTG ACA GCT GGT TAC ACA GGA ATC AAA TGC ACC AAC TTT TCC TCC TGC TGG AGT GAT GGG AAG ATG TTC AAT GCA CTC ATT CAC CGA TAC CGA CCC GAT CTA GTA GAC ATG GAG AGG GTG CAA ATC CAA AGT AAC CGA GAG AAT CTG GAA CAG GCT TTT GAA GTG GCA GAA AGA CTG GGG GTC ACT CGC CTG CTG GAT GCA GAA GAT GTG GAT GTG CCA TCT CCA GAT GAA AAG TCT GTA ATC ACT TAT GTG TCT TCG ATT TAT GAT GCC TTC CCT AAA GTT CCT GAG GGT GGA GAA GGG ATC AGT GCT ACG GAA GTG GAC TCC AGG TGG CAA GAA TAC CAA AGC CGA GTG GAC TCC CTC ATT CCC TGG ATC AAA CAG CAT ACA ATA CTG ATG TCA GAT AAA ACT TTT CCC CAA AAC CCT GTT GAA CTA AAG GCA CTT TAT AAC CAA TAT ATA CAC TTC AAA GAA ACA GAA ATT CTG GCC AAG GAG AGA GAA AAA GGA AGA ATT GAG GAA TTA TAT AAA TTA CTA GAG GTG TGG ATT GAA TTT GGC CGA ATT AAA CTG CCT CAA GGT TAT CAC CCT AAT GAT GTG GAA GAA GAG TGG GGA AAG CTC ATC ATA GAG ATG CTG GAA CGA GAG AAA TCA CTT CGG CCG GCT GTG GAG AGG CTG GAA TTG CTG CTA CAG ATT GCA AAC AAA ATC CAG AAT GGT GCT TTG AAC TGT GAA GAA AAA CTG ACA CTA GCT AAG AAT ACA CTG CAG GCT GAT GCT GCT CAC CTG GAA TCA GGA CAA CCG GTA CAA TGT GAG TCA GAT GTC ATT ATG TAC ATT CAG GAG TGT GAA GGT CTC ATC AGG CAG CTG CAG GTG GAT CTC CAG ATC CTG CGG GAT GAG AAT TAC TAC CAG CTA GAA GAG CTG GCT TTT AGG GTC ATG CGT CTT CAG GAT GAG CTG GTC A

The idea is that this allows users to filter by themselves. Hope that helps