tipputa / Circular-genome-visualizer

4 stars 1 forks source link

Possible to annotate/identify orfs in plot? #2

Open thorellk opened 7 years ago

thorellk commented 7 years ago

Hi again,

I am playing around with the output of the visualizer and wondered if you have any method to annotate in the plots which genes that have the highest tendency to be found at different coordinates in the genome?

Best wishes,

Kaisa

tipputa commented 7 years ago

Thank you very much for using my program and your question. I am afraid I don't have any method to annotate genes in the plots and I have no plan to implement it.

You can check gene annotations from tsv files, data/locusTag*.tsv and data/output*consensus.tsv. The output_*consensus.tsv file includes gene locations (angle) and the consensus position. The locusTag*.tsv file includes locus_tag of each gene. Because the number of rows and the order of clusters are the same between the files, please combine them into one file and check the gene clusters that have high standard deviation.

Best regards, Ipputa

thorellk commented 7 years ago

Thank you for your fast response. I have looked through the other files as you suggested but wonder a bit over one feature of the results.

In the locusTag*.tsv core genome list there are several genes within each strain that occur in several different gene clusters. As an example from one of the strains I am analysing 235 genes occur in more than one gene cluster, between 2 and 12 per gene (average 3.16). Could you please explain this to me? In all standard core genome approaches I have used before each gene is assigned to one orthologous group/gene cluster only. While way the analysis is made in your software the size of the core genome is inflated and it also means that one gene can be present in several different locations in the plots? I can understand the rationale for this from a plotting perspective but think it becomes a bit difficult to interpret. On the other hand, I understand the alternative to put several genes from the same strain into the same cluster also leads to problems when it comes to calculating the consensus and plotting. Have I understood it correctly and could you explain to me a bit in detail when this will occur?

Best wishes, Kaisa

tipputa commented 7 years ago

Thank you for very important questions. As you know, my program for orthologous gene clustering has some problems, but it could not affect the visualization. Actually, several genes that occur in several different gene clusters are just plotted once.

My program is insufficient to report the number of core- and pan-genome because it doesn't merge similar gene clusters. For example, In other orthologous clustering tools like GET_HOMOLOGOUS, duplicated genes are clustered into one gene cluster as follows;

  genome1 genome2 genome3 consensus
clusterA gen1A(loc: 30) gen2A(loc:30), gen2B(loc:34)* gen3A(loc: 30) loc: 31

*: They are duplicated genes

On the other hand, my program doesn't merge similar gene clusters and removes several genes that appear more than once. In the following case, genes in clusterA1 and gen2B are plotted, but gene1A and gene3A in clusterA2 are not used in the visualization.

  genome1 genome2 genome3 consensus
clusterA1 gen1A(loc: 30) gen2A(loc:30) gen3A(loc: 30) loc: 30
clusterA2 gen1A(loc: 30) gen2B(loc:34) gen3A(loc: 30) loc: 31

While strictly speaking it is a problem, however, this tool was developed for rough visualization to easily understand complicated genome structures and successfully works for this purpose.

Best regards, Ipputa