Closed lnyawen closed 1 year ago
Hi @lnyawen
"The third column contains three genes, reg_14433, reg_10659 and reg_14509, so I think there are three copies of the Amylase gene in this species, is that right?" yes, this is generally right. Just make sure that your grep expression does not catch any additional genes. But grep ${gene ID in reference} | cut -f3 | sort -u should give you a list of orthologous genes in the query.
Regarding this "For example, to identify the location of gene reg_14433, I want to calculate the longest of ENST00000370079.AMY1C.1087156, ENST00000684141.AMY1C.1087156, ENST00000361355.AMY2B.38, ENST00000370080.AMY1B.7580, ENST00000622339.AMY2A.38397 and ENST00000414303.AMY2A.38397 to represent this gene, is that right?"
Actually, depends on what exactly you need. I would recommend finding transcripts with those IDs in the query_annotation.bed and select some isoform. Maybe also look at gene_loss_summary.tsv - maybe, the longest isoform is not fully intact (can be a single inactivating mutation such as fremeshifting indel which is not enough to call the gene "Lost", but definitely something to take into account)
Best, Bogdan
Hello @kirilenkobm
Thanks for your quick reply! I got your point, you've completely solved my problem.
thanks! Yawen
Dear @MichaelHiller @kirilenkobm :
I want to infer how many copies of a gene are in a species, do I just need to count the number of q_gene in the orthology_classification.tsv file to assume it is the number of copies of that gene? For expample, I want to calculate the copy number of the Amylase gene, so I grep all record about AMY with command
grep 'AMY' orthologsClassification.tsv
. The third column contains three genes, reg_14433, reg_10659 and reg_14509, so I think there are three copies of the Amylase gene in this species, is that right?Besides, I want to identify the location of this gene. Can I determine the location of this gene by calculating the position of the longest transcript of a gene in the query genome? For example, to identify the location of gene reg_14433, I want to calculate the longest of ENST00000370079.AMY1C.1087156, ENST00000684141.AMY1C.1087156, ENST00000361355.AMY2B.38, ENST00000370080.AMY1B.7580, ENST00000622339.AMY2A.38397 and ENST00000414303.AMY2A.38397 to represent this gene, is that right?
Thank you for your help
Yawen