tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Not expected group number of ALLHiC results #61

Closed LQHHHHH closed 4 years ago

LQHHHHH commented 4 years ago

Hi, @tangerzhang

I have finished my contig-level assembly using hifiasm with hifi reads. Because Purge-dups was contained in hifiasm, so I directly using Purge-dups contigs. Following your tutorial (https://github.com/tangerzhang/ALLHiC/wiki), I skip the Prune step and after Partition step, I found only 6 groups were given by ALLHiC, but -k was 17 or higher. It's my contigs includes so many misjoin or any steps I did wrong?

tangerzhang commented 4 years ago

What is the coverage of your HiC reads? I assume that it is possibly due to low coverage of HiC sequencing or the mapping issues. Please also check your mapping bam. Is it normal or disrupt?

LQHHHHH commented 4 years ago

Hi, @tangerzhang

The coverage of my hi-c data is ~200x. So I checked the sample.clean.bam file which generated by Map Hi-C reads to draft assembly step. After filtering, 99% of my Hi-C reads were mapped to 17 longest contigs and other short contigs were nearly no reads mapped and my chromosome number were 17. Then I checked the raw aligned bam generated bybwa aln && baw sampe and found the reads can be found in other short contigs. ALLHiC only provided 6 groups which included 12 contigs, however, these contigs were short and little hi-c reads were mapped. It's very strange.

tangerzhang commented 4 years ago

OK, I guess that the 17 longest contigs are actually chromosomal level assembly. Perhaps you do not need Hi-C reads for scaffolding.

LQHHHHH commented 4 years ago

Dr. Zhang, Thank you for your reply.

But why AllHiC cannot work in this case? I run SALSA2 and it corrected some misjoin of my contigs and give me FINAL scaffolds. But it gives me over 19 scaffolds.

tangerzhang commented 4 years ago

You can also use ALLHiC_corrector to correct the chimeric contigs and then use ALLHiC to build the chromosomal level assembly. What I meant before is that the 17 longest contigs likely represent 17 pseudo-chromosomes. If they account for a large proportion of genome sequences (e.g. >90%), you will not need to perform scaffolding.

LQHHHHH commented 4 years ago

Thank you! Last question, Should I remove these short contigs (length<500kb) first, before performing mapping step?

tangerzhang commented 4 years ago

There is no need to remove the short contigs as these contigs have very limited restriction sites (cutoff: 25) and thus will not be included in the Hi-C scaffolding.

LQHHHHH commented 4 years ago

Thank you!