hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
160 stars 22 forks source link

Gene symbol assignment in the query species #87

Closed dariotommasini closed 10 months ago

dariotommasini commented 1 year ago

I'd like to use the TOGA-derived orthologs from your paper. I'm confused as to how I would use the TOGA annotations found in orthologsClassification.tsv.gz because the gene symbol of the query species is not available. From my understanding, your group did not supply the ENSEMBL annotations of the query so I'm unsure of how to map the orthologs found by TOGA to the query species' gene symbol.

Perhaps the way I should be using TOGA is to simply identify the human orthologous genes and then use the ENSEMBL orthology tables to find the gene symbols in the query species.

Regardless, your help would be much appreciated!

MichaelHiller commented 1 year ago

Hi Dario,

This file has the orthology type (1:1, 1:many etc), defined as the number of reference genes vs. the number of intact orthologs in the query.

TOGA assigns each query gene a identifier reg_number. If you have a 1:1, you can simply keep the gene symbol, e.g. ENSG00000137672 ENST00000344327.TRPC6 reg_12370 ENST00000344327.TRPC6.22 one2one ENSG00000137672 ENST00000348423.TRPC6 reg_12370 ENST00000348423.TRPC6.22 one2one ENSG00000137672 ENST00000360497.TRPC6 reg_12370 ENST00000360497.TRPC6.22 one2one ENSG00000137672 ENST00000532133.TRPC6 reg_12370 ENST00000532133.TRPC6.22 one2one reg_12370 is TRPC6 and there are 4 transcripts.

If you have a 1:many, this would also work, e.g. ENSG00000184840 ENST00000332598.TMED9 reg_4419 ENST00000332598.TMED9.56373 one2many ENSG00000184840 ENST00000332598.TMED9 reg_4495 ENST00000332598.TMED9.127 one2many --> two TMED9 genes in the query.

If you have many:1 or many:many, it it not obvious which symbol to assign, e.g. ENSG00000172062 ENST00000625245.SMN1 reg_12660 ENST00000625245.SMN1.709 many2one ENSG00000172062 ENST00000506163.SMN1 reg_12660 ENST00000506163.SMN1.709 many2one ENSG00000172062 ENST00000380707.SMN1 reg_12660 ENST00000380707.SMN1.709 many2one ENSG00000172062 ENST00000503079.SMN1 reg_12660 ENST00000503079.SMN1.709 many2one ENSG00000205571 ENST00000628696.SMN2 reg_12660 ENST00000628696.SMN2.1564 many2one ENSG00000205571 ENST00000380743.SMN2 reg_12660 ENST00000380743.SMN2.1564 many2one ENSG00000205571 ENST00000380742.SMN2 reg_12660 ENST00000380742.SMN2.1564 many2one ENSG00000205571 ENST00000638794.SMN2 reg_12660 ENST00000638794.SMN2.1564 many2one ENSG00000205571 ENST00000626847.SMN2 reg_12660 ENST00000626847.SMN2.1564 many2one SMN is a human duplication. You can call the single SMN gene in the query SMN1 or SMN2 (randomly pick a name), but it is not easy to correctly automate this.

If you already have a query annotation from Ensembl, then yes, you could overlap the TOGA transcripts with the Ensembl annotation and keep Ensembl's gene symbols. Ensembl also provides orthology information.

Hope that helps

dariotommasini commented 1 year ago

Thank you, this is very helpful.

Could you expand on what overlapping the TOGA transcripts with the Ensembl annotation would look like? Would it be like comparing the gtf from ENSEMBL to the TOGA gtf, and then transferring the ENSEMBL identifiers onto the TOGA file? Or is it much simpler than that: using the ENSEMBL orthology table to go from human gene to the ortholog in query species, and ignoring ENSEMBL's orthology classification since we're using TOGA's instead?

MichaelHiller commented 1 year ago

Hmm, I would probably prefer the first option, as this makes sure that TOGA and Ensembl agree on the exact locus in the query genome.

dariotommasini commented 1 year ago

Okay thanks for the help!

dariotommasini commented 1 year ago

Is there a quick way to identify the assembly used for each of the mammalian species in the human_hg38_reference directory? I couldn't find all of them in the mammalianDNAZooAssemblies, ie the chinese treeshrew tupChi1.

Also, can TOGA be used to retrieve orthologous sequences that are not necessarily genes, like introns and intergenic regions. I know it uses intronic and intergenic similarity in the model but it seems like the main output is orthologous genes and protein alignments.

MichaelHiller commented 1 year ago

Pls have a look at https://genome.senckenberg.de/download/TOGA/human_hg38_reference/overview.table.tsv Every directory has the species name + a UCSC or our internal identifier (starting with HL). The tsv file lists the assembly accession and source, taxonomic lineage and more.

2) In principle yes, although this is not the primary output of TOGA. I guess the best is to use the query_annotation.bed file, which lists exons (and implicitly introns) in the query.

dariotommasini commented 1 year ago

Thank you!

  1. I see, I can get introns from inverting the exon coordinates. But intergenic regions are not recoverable?
MichaelHiller commented 1 year ago

Harder ... Why not use a whole genome alignment for that? E.g. alignment nets or Cactus?

dariotommasini commented 1 year ago

Okay, that makes sense. Thank you for the help, I'll definitely try using the TOGA ortholog annotations!

dariotommasini commented 1 year ago

Hi again,

I was trying to make a CellRanger reference package for tupChi1 using the TOGA annotation file and the ncbi source FASTA. According to the overview table, the NCBI accession is "GCF_000334495.1", so I downloaded that FASTA from NCBI. However, I'm encountering an issue because the contigs in the TOGA GTF do not match those in the FASTA:

mkref has failed: error building reference package
Error while parsing GTF file /clusterfs/kslab/REFERENCE/TreeShrew_TOGA/TreeShrew.gtf.gz
Invalid contig name encountered on GTF line 1: KB320809. The FASTA file has contigs:
['NW_006159427.1', 'NW_006159428.1', 'NW_006159429.1', 'NW_006159430.1', ...

Contigs in the FASTA look like "NW_006159706.1" whereas the TOGA contigs look like "KB320653". Do I need to convert the TOGA-generated contigs somehow?

MichaelHiller commented 1 year ago

Hi Dario,

the tupChi1 assembly comes from UCSC and according to UCSC UCSC Genome Browser assembly ID: tupChi1 Sequencing/Assembly provider ID: Beijing Genomics Institute TupChi_1.0 Assembly date: Jan. 2013 Accession ID: GCA_000334495.1 NCBI Genome ID: 13028 (Tupaia chinensis) NCBI Assembly ID: 531728 NCBI BioProject ID: 221634 NCBI BioSample ID: SAMN01084043

It has this GCA ID.

NCBI distinguishes between RefSeq and Genbank and they have the habit of renaming scaffolds. I assume the GCA and GCF are actually identical, but they just have different names. Can you pls try to use the GCA assembly?

dariotommasini commented 1 year ago

Thank you, that worked! The GTF was still missing the ".1" suffix, ie "KB320809.1", but that was easy enough to add to the GTF.

MichaelHiller commented 1 year ago

UCSC is always stripping the .1 We typically do the same, as some tools have problems correctly processing these genomes.

dariotommasini commented 1 year ago

Hi again!

Using the TOGA GTF for mapping reads using Cellranger leads to this following downstream issue:

The top-level of the GTF is transcripts, not genes, ie:

KB321016.1  stdin   transcript  1542828 1777300 .   -   .   gene_id "ENST00000359726.APP.304"; transcript_id "ENST00000359726.APP.304";

This is problematic because then STAR will toss the multi-mapping reads when they map to two similar transcripts of the same gene. I am considering adding a "gene" row for each gene so that STAR recognizes that it should lump all the transcripts of "APP" together. The gene I suppose would be the query gene identifiers supplied in the orthologyAnnotations.tsv ie "reg_14844" for APP in tupChi1:

ENSG00000142192 ENST00000346798.APP reg_14844 ENST00000346798.APP.304         one2one 

Does this sound reasonable to you? I'm guessing there is a reason why your group decided to omit "gene" level information in the GTFs.

MichaelHiller commented 1 year ago

Sorry, I don't have much experience with GTF/GFF (horrible formats, genePred or bed12 is much better). But here is a consideration. TOGA does not give you UTRs. Most of your scRNA reads will map to UTRs and thus in a TOGA-only annotation downstream of a gene.

It would help to also generate RNA-seq and/or IsoSeq from the same tissue to add UTRs to the TOGAs. Then you have proper gene models to map the scRNA-seq reads to.