Closed alexyfyf closed 11 months ago
Thank you for using it! Each cluster (from isONclust
) represents all transcripts expressed in the dataset from a gene (or gene family with similar gene copies). isONclust
does not do transcript level clustering yet. Therefore, the resulting corrected clusters with isONcorrect
are still at 'gene level'.
We are currently working on the consensus generation of each transcript. Will get back to you, and this issue, when we have something stable.
Thank you so much for your reply.
It makes me confused. I thought the cluster would be like a transcript, could you explain why it is only at gene level? I might have missed that in your paper, but do you mean all spliced isoforms from the same gene would be in the same cluster? Could you point me to the part that your described this? Sorry if I should ask in isOnclust
repo.
Thanks a lot again.
Yes, spliced isoforms from the same gene should be in the same cluster. isONclust
cluster on gene-level; all isoforms from a gene is put in the same cluster (that is the goal at least). It is mentioned briefly in the abstract and in last section of the introduction in the isONclust paper. In isONcorrect paper it is mentioned at the start of the "Algorithm overview" section as well as implied in e.g Figure 1 and in discussion.
The "why" is because it was designed that way for the correction step: isoncorrect
gets more coverage per exon to infer what are errors and what are mutations, leading to lower post-correction error rate. We write this in the isONcorrect paper:
"One of the underlying strengths of the isONcorrect algorithm is its ability to error correct reads even if there are as little as one read per transcript. The idea is to leverage exons that are shared between different splice isoforms. To achieve this, we pre-process the reads using our isONclust clustering algorithm, which clusters reads according to the gene family of origin. This strategy is in sharp contrast to approaches which cluster based on the isoform of origin. Such clustering results in low read coverage per transcript24, particularly for genes expressing multiple isoforms with variable start and stop sites and makes error correction unable to utilize full coverage over shared exons."
Thanks for developing this tool. I'm wondering, each cluster basically represent a isoform/transcript of genes, but how can we get the final CDS after correction for each cluster (i.e. transcript)
Thanks for you help