how to interpret the output plot

aaannaw commented 3 months ago

Hello,Professor I downloaded the TOGA result file for several species from https://genome.senckenberg.de/download/TOGA/mouse_mm10_reference/Rodentia/. I focused on a gene ENSMUSG00000057170 and the gene is showed loss status in the loss_summ_data.tsv file. The plot for two projections of the gene is showed in one of species: The detailed mutation for the two projections is showed in the geneInactivatingMutations.tsv file, which showed all exons is deleted (masked) and proportion of intact is 0. However, in the ReadeMe, the blue color represented intact status. Thus, the color for the mutation plot is confused for me. I also see the red box, which is represented for loss exon (Mask). Maybe deleted exon with mask or unmask is different? Also, I am confused for the criterion of several status for gene level. The NSMUSG0000001572 gene is depicted to be Missing status and the plot is showed. Perhaps the gene referred as missing is characterized that the proportion of in intact is below the specified value. Also, I see some genes can not be attributed to be any status for several genomes. Could you give me any suggestions and looking forward with your reply. Best wishes! Na Wan

ReverendCasy commented 3 months ago

Hello Na Wan, Colour code you provided refers to projection representation in the UCSC browser and denotes projection (=transcript) loss status. In the figures you showed, exons are coloured according to exon loss status with the following colour designation:

cyan: exon is present;
blue: exon is deleted and masked (deletion does not shift the reading frame);
red: exon is deleted and not masked (deletion causes frameshift);
grey: exon is missing.

For example, in the last figure attached the upper transcript has exons 1-6 and 11-14 are missing, exons 7, 8, and 10 are present, and exon 9 is ’safely’ deleted and therefore masked. Hope that helps.

aaannaw commented 3 months ago

@ReverendCasy Hi, Thanks for your reply.
In the output file, since a gene has multiple transcripts and multiple projections with different status, how to judge the status for a gene. In the picture from "A genomics approach reveals insights into the importance of gene losses for mammalian adaptations", the only status for a gene was showed. So should I choose which transcript or projection to represent the status for corresponding gene.
ee79ecdb0935e977d2d413a58eeee6e

Also, I see some genes can not be attributed to be any status in some species in the output file. That's why.

Best wishes! Na Wan

ReverendCasy commented 3 months ago

In the output file, since a gene has multiple transcripts and multiple projections with different status, how to judge the status for a gene.

Gene presence status can be found in the _loss_summdata.tsv file. Apparently, transcript data are not summarised at gene level in these plots.

Also, I see some genes can not be attributed to be any status in some species in the output file. That's why.

Do you mean that genes in the _loss_summdata.tsv have the ’N’ loss status ?

aaannaw commented 3 months ago

Hi, @ReverendCasy I am Sorry for late response.

Gene presence status can be found in the _loss_summdata.tsv file. Apparently, transcript data are not summarised at gene level in these plots.

Like the following picture, Nlrp5 gene has three transcripts: ENSMUST00000086341, ENSMUST00000015866 and ENSMUST00000108441. Every transcript has common two projections: 43188 and 674755. If I want to show the status for Nlrp5 gene, should I choose which one in the following picture. In the loss_summ_data.tsv file, just terminal and simple status (I, PL, UL, M/PM, L and PG) were showed.

Do you mean that genes in the _loss_summdata.tsv have the ’N’ loss status ? I can not find the "N" status in the loss_summ_data.tsv file. For example, when I search the status of ENSMUSG00000057170 gene, I can not find in the loss_summ_data.tsv.

Moreover, I found the intact was represented by two ways: INTACT_PERC_IGNORE_M and INTACT_PERC_INTACT_M. Which one was statistically corrected. In the geneInactivatingMutations.tsv file, one gene has multiple intact value in the following picture and which one should be chosen to represent the intact value of a gene?

Best wishes! Na Wan

ReverendCasy commented 3 months ago

Every transcript has common two projections: 43188 and 674755. If I want to show the status for Nlrp5 gene, should I choose which one in the following picture. In the loss_summ_data.tsv file, just terminal and simple status (I, PL, UL, M/PM, L and PG) were showed.

Not sure if I understand what you mean by ‘simple statuses’ here. I/PI/L/etc. denote entity loss status in the query genome. Transcripts get the best status (in terms of presence) of their respective projections, and genes are classified by the best status of their respective isoforms. These data are not reflected in the plots you attached and should be sought for in _loss_summdata.tsv.

For example, when I search the status of ENSMUSG00000057170 gene, I can not find in the loss_summ_data.tsv.

Please remove whitespace from the grep pattern, the gene is definitely present in the file.

Moreover, I found the intact was represented by two ways: INTACT_PERC_IGNORE_M and INTACT_PERC_INTACT_M. Which one was statistically corrected.

INTACT_PERC_IGNORE_M stands for the intact fraction ignoring the missing portion of the projection altogether, and INTACT_PERC_INTACT_M is calculated by assuming missing portion to be present and intact. These values are used for projection loss status classification, and I would not recommend using them as overall sequence conservation measure. Also, note that these values correspond to projections, not genes.

MichaelHiller commented 5 days ago

Hi all,

sorry, I was on vacation when this was discussed. Thx ReverendCasy for explaining all.

Here is a supplement figure that visually explains the intact RF measures

hillerlab / TOGA

how to interpret the output plot #176