hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
160 stars 22 forks source link

Counting the number of copies of genes in orthology_classification.tsv file #38

Closed lnyawen closed 1 year ago

lnyawen commented 1 year ago

Dear @MichaelHiller @kirilenkobm :

I want to infer how many copies of a gene are in a species, do I just need to count the number of q_gene in the orthology_classification.tsv file to assume it is the number of copies of that gene? For expample, I want to calculate the copy number of the Amylase gene, so I grep all record about AMY with command grep 'AMY' orthologsClassification.tsv. The third column contains three genes, reg_14433, reg_10659 and reg_14509, so I think there are three copies of the Amylase gene in this species, is that right?

$ grep 'AMY' orthologsClassification.tsv 
ENSG00000237763 ENST00000370083.AMY1A   None    None    one2zero
ENSG00000187733 ENST00000370079.AMY1C   reg_14433   ENST00000370079.AMY1C.1087156   many2many
ENSG00000187733 ENST00000370079.AMY1C   reg_10659   ENST00000370079.AMY1C.7434  many2many
ENSG00000187733 ENST00000370079.AMY1C   reg_14509   ENST00000370079.AMY1C.38    many2many
ENSG00000187733 ENST00000684141.AMY1C   reg_10659   ENST00000684141.AMY1C.7434  many2many
ENSG00000187733 ENST00000684141.AMY1C   reg_14433   ENST00000684141.AMY1C.1087156   many2many
ENSG00000187733 ENST00000684141.AMY1C   reg_14509   ENST00000684141.AMY1C.38    many2many
ENSG00000240038 ENST00000361355.AMY2B   reg_14433   ENST00000361355.AMY2B.38    many2many
ENSG00000240038 ENST00000361355.AMY2B   reg_10659   ENST00000361355.AMY2B.22907 many2many
ENSG00000174876 ENST00000370080.AMY1B   reg_14509   ENST00000370080.AMY1B.7434  many2many
ENSG00000174876 ENST00000370080.AMY1B   reg_14433   ENST00000370080.AMY1B.7580  many2many
ENSG00000174876 ENST00000370080.AMY1B   reg_10659   ENST00000370080.AMY1B.38    many2many
ENSG00000243480 ENST00000622339.AMY2A   reg_10659   ENST00000622339.AMY2A.7580  many2many
ENSG00000243480 ENST00000622339.AMY2A   reg_14509   ENST00000622339.AMY2A.10004 many2many
ENSG00000243480 ENST00000622339.AMY2A   reg_14433   ENST00000622339.AMY2A.38397 many2many
ENSG00000243480 ENST00000414303.AMY2A   reg_14509   ENST00000414303.AMY2A.10004 many2many
ENSG00000243480 ENST00000414303.AMY2A   reg_14433   ENST00000414303.AMY2A.38397 many2many
ENSG00000243480 ENST00000414303.AMY2A   reg_10659   ENST00000414303.AMY2A.7580  many2many

Besides, I want to identify the location of this gene. Can I determine the location of this gene by calculating the position of the longest transcript of a gene in the query genome? For example, to identify the location of gene reg_14433, I want to calculate the longest of ENST00000370079.AMY1C.1087156, ENST00000684141.AMY1C.1087156, ENST00000361355.AMY2B.38, ENST00000370080.AMY1B.7580, ENST00000622339.AMY2A.38397 and ENST00000414303.AMY2A.38397 to represent this gene, is that right?

grep 'reg_14433' orthologsClassification.tsv 
ENSG00000187733 ENST00000370079.AMY1C   reg_14433   ENST00000370079.AMY1C.1087156   many2many
ENSG00000187733 ENST00000684141.AMY1C   reg_14433   ENST00000684141.AMY1C.1087156   many2many
ENSG00000240038 ENST00000361355.AMY2B   reg_14433   ENST00000361355.AMY2B.38    many2many
ENSG00000174876 ENST00000370080.AMY1B   reg_14433   ENST00000370080.AMY1B.7580  many2many
ENSG00000243480 ENST00000622339.AMY2A   reg_14433   ENST00000622339.AMY2A.38397 many2many
ENSG00000243480 ENST00000414303.AMY2A   reg_14433   ENST00000414303.AMY2A.38397 many2many

Thank you for your help

Yawen

kirilenkobm commented 1 year ago

Hi @lnyawen

"The third column contains three genes, reg_14433, reg_10659 and reg_14509, so I think there are three copies of the Amylase gene in this species, is that right?" yes, this is generally right. Just make sure that your grep expression does not catch any additional genes. But grep ${gene ID in reference} | cut -f3 | sort -u should give you a list of orthologous genes in the query.

Regarding this "For example, to identify the location of gene reg_14433, I want to calculate the longest of ENST00000370079.AMY1C.1087156, ENST00000684141.AMY1C.1087156, ENST00000361355.AMY2B.38, ENST00000370080.AMY1B.7580, ENST00000622339.AMY2A.38397 and ENST00000414303.AMY2A.38397 to represent this gene, is that right?"

Actually, depends on what exactly you need. I would recommend finding transcripts with those IDs in the query_annotation.bed and select some isoform. Maybe also look at gene_loss_summary.tsv - maybe, the longest isoform is not fully intact (can be a single inactivating mutation such as fremeshifting indel which is not enough to call the gene "Lost", but definitely something to take into account)

Best, Bogdan

lnyawen commented 1 year ago

Hello @kirilenkobm

Thanks for your quick reply! I got your point, you've completely solved my problem.

thanks! Yawen