tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
173 stars 39 forks source link

Why Allhic is allele aware? #110

Closed Yutang-ETH closed 2 years ago

Yutang-ETH commented 2 years ago

Hi,

Maybe this is a silly question but I really don't understand how Allhic can connect contigs from the same haplotype.

For example, in a heterozygous diploid genome, there are two regions on a chromosome, A and B, each region is assembled twice, namely there are two allelic contigs for each region, A1, A2 and B1, B2 (A1 and B1 are on the same haplotype and A2 and B2 are on the other haplotype). My question is how does Allhic know A1 should be linked to B1 and A2 should be linked to B2. Based on what information Allhic phase the alleles between two loci? I checked the wiki page, but I didn't find the satisfactory answer. Please help me understand this.

Thank you very much.

Best wishes, Yutang

tangerzhang commented 2 years ago

The basic concept of Hi-C scaffolding is that the intra-chromosome is more likely to contact than inter-chromosomes. In other words, if A1 and B1 are from the same chromosome, they should have more Hi-C link signals than that between A1/B1 and A2/B2. Please find more details in our publication: Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data

Yutang-ETH commented 2 years ago

Thank you very much for your reply. More exactly my question is how do you know the Hi-C contact is inter or intra? Do you rely on any SNP information? I understand that more Hi-C contact should occur within the same chromosome than that between homologous chromosomes, but when you map Hi-C reads to the genome, how do you know that your reads are mapped exactly to the chromosome where they are from?

Best wishes, Yutang

tangerzhang commented 2 years ago

OK, that's why we need to extract mapping reads with high quality. One of the solutions is to retain uniquely mapped reads.

Yutang-ETH commented 2 years ago

Thank you very much for your explanation. I think it makes sense to me.

By the way, I read the tie guan yin tea genome paper, there you said there are still some switch errors in the haplotype-resolved genome assembly, I am wondering is the switch caused by Canu originally during assembly or introduced by Allhic during Hi-C scaffolding? Could you please shed some light on that?

Best wishes, Yutang

tangerzhang commented 2 years ago

Both CANU and ALLHiC may introduce switch errors. But a highly accurate contig-level assembly will greatly reduce the switch errors based on our simulation data.

Yutang-ETH commented 2 years ago

Thank you very much for your patient explanation. OK, I think I got it.

Have a nice evening if you're in China.

Best wishes, Yutang