how to obtain consensus cds after correction

alexyfyf commented 1 year ago

Thanks for developing this tool. I'm wondering, each cluster basically represent a isoform/transcript of genes, but how can we get the final CDS after correction for each cluster (i.e. transcript)

Thanks for you help

ksahlin commented 1 year ago

Thank you for using it! Each cluster (from isONclust) represents all transcripts expressed in the dataset from a gene (or gene family with similar gene copies). isONclust does not do transcript level clustering yet. Therefore, the resulting corrected clusters with isONcorrect are still at 'gene level'.

We are currently working on the consensus generation of each transcript. Will get back to you, and this issue, when we have something stable.

alexyfyf commented 1 year ago

Thank you so much for your reply.

It makes me confused. I thought the cluster would be like a transcript, could you explain why it is only at gene level? I might have missed that in your paper, but do you mean all spliced isoforms from the same gene would be in the same cluster? Could you point me to the part that your described this? Sorry if I should ask in isOnclust repo.

Thanks a lot again.

ksahlin commented 1 year ago

Yes, spliced isoforms from the same gene should be in the same cluster. isONclust cluster on gene-level; all isoforms from a gene is put in the same cluster (that is the goal at least). It is mentioned briefly in the abstract and in last section of the introduction in the isONclust paper. In isONcorrect paper it is mentioned at the start of the "Algorithm overview" section as well as implied in e.g Figure 1 and in discussion.

The "why" is because it was designed that way for the correction step: isoncorrect gets more coverage per exon to infer what are errors and what are mutations, leading to lower post-correction error rate. We write this in the isONcorrect paper:

"One of the underlying strengths of the isONcorrect algorithm is its ability to error correct reads even if there are as little as one read per transcript. The idea is to leverage exons that are shared between different splice isoforms. To achieve this, we pre-process the reads using our isONclust clustering algorithm, which clusters reads according to the gene family of origin. This strategy is in sharp contrast to approaches which cluster based on the isoform of origin. Such clustering results in low read coverage per transcript24, particularly for genes expressing multiple isoforms with variable start and stop sites and makes error correction unable to utilize full coverage over shared exons."

ksahlin commented 11 months ago

I now consider this issue solved as we developed isONform that takes isONcorrect reads and produces predicted transcripts, which OP sought.

ksahlin / isONcorrect

how to obtain consensus cds after correction #23