tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

What is different between ALLHiC and 3D-DNA? #53

Closed sunnycqcn closed 4 years ago

sunnycqcn commented 4 years ago

Hello, I tried to use ALLHiC and 3D-DNA to scaffold my draft genome. My genome is 9 chromosomes. I used 3D-DNA only get 5 chromosomes. I set 9 groups with ALLHiC. I got 9 Chromosomes (> 25Mb). Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage


All                     30             633     567,657,168     567,596,868    99.99%

10 KB 30 633 567,657,168 567,596,868 99.99% 25 KB 27 630 567,585,544 567,525,244 99.99% 50 KB 13 616 567,047,389 566,987,089 99.99% 100 KB 10 613 566,860,262 566,799,962 99.99% 250 KB 10 613 566,860,262 566,799,962 99.99% 500 KB 10 613 566,860,262 566,799,962 99.99% 1 MB 10 613 566,860,262 566,799,962 99.99% 2.5 MB 10 613 566,860,262 566,799,962 99.99% 5 MB 10 613 566,860,262 566,799,962 99.99% 10 MB 9 612 559,898,565 559,838,265 99.99% 25 MB 9 612 559,898,565 559,838,265 99.99% 50 MB 5 483 401,298,811 401,251,011 99.99% 100 MB 1 52 155,342,789 155,337,689 100.00% My quesition is what is different between ALLHiC and 3D-DNA? Thanks, Fuyou

tangerzhang commented 4 years ago

Wow, this is a big question and very difficult to answer in a few words. Briefly, ALLHiC is designed for polyploid genome scaffolding and also can be used in diploid genome scaffolding. The advantage of ALLHiC is that it can phase haplotypes in polyploid genomes and have better performance for scaffolding of short contigs. 3D-DNA is definitely a wonderful tool and able to correct mis-assembled contigs, thus producing reliable Hi-C scaffolding for simple diploid genomes. However, 3D-DNA is not able to generate chromosomal level assembly in some cases based on my experience (and also in your case). If so, ALLHiC could be an option that might solve the problem you are facing.

sunnycqcn commented 4 years ago

Hello, I am much appreciated for your explains. My genome is plant genome, Brasssica oleracea, with high homologious region. I think that it is the reason that I can not get good results with 3D-DNA. Do you have experience on using ALLHiC scaffolding, then using 3D-DNA to correct mis-assembled? Thanks, Fuyou

tangerzhang commented 4 years ago

Hi Fuyou, We recently developed ALLHiC_corrector, which has the similar function as 3D-DNA pipeline to correct mis-assembled contigs. Codes can be found here: https://github.com/tangerzhang/ALLHiC/blob/master/bin/ALLHiC_corrector

ALLHiC_corrector -m mapping.sorted.bam -r seq.fasta -o seq.correct2.fasta -t 12

Please note that this script requires pysam and numpy installed in your python environment. And the bam file should be sorted and indexed using samtools. After the correction, you can use ALLHiC pipeline to generate a new round of Hi-C scaffolding.

sunnycqcn commented 4 years ago

Hello, Thanks. I have tried this pipeline. I still find there have some misassembly with reference genome. The attachment is my result which final scaffods compared with reference genome using LAST. BOC

tangerzhang commented 4 years ago

Hi Fuyou, Thanks for sharing the results. It looks like mis-scaffolding happened in the partition stage. But there is a good news that you have a reference genome. You can use reference assistant assembly to help you cluster the corrected contigs. Below are my suggestions: 1) use ragoo to generate a reference-guilded genome assembly (https://github.com/malonge/RaGOO) 2) go to the directory that contains ordering results and collect ordering files with chromosomal level assembly. commands like below:

$ cd ragoo_output/orderings
$ find -name '*_orderings.txt'|grep -v Scaffold > orderings.list

3) After that, you can use my wrapped script to optimize the ordering and orientation of each group based on ALLHiC. The script can be found: https://github.com/tangerzhang/ALLHiC/blob/master/scripts/ragoo2ALLHiC.pl

perl ragoo2ALLHiC.pl -l orderings.list -r seq.HiCcorrected.fasta -b sample.bwa_mem.bam

The Hi-C corrected fasta (seq.HiCcorrected.fasta) could be from ALLHiC_corrector. Alternatively, if this does not give you a good result, you can also try the corrected fasta from 3D-DNA. Since ALLHiC_corrector is still in development, we are not sure whether it can produce a result as good as 3D-DNA. That's why we have not officially release the code. Hope this could help you out!

sunnycqcn commented 4 years ago

Hello, I finished the assembly based on your suggestions. One is from no-corrected contigs, which is better. Other is from corrected contigs by 3D-DNA. This is from corrected contigs by 3D-DNA. BO3R This is the draft contigs using smartdenovo with corrected long reads. BOR

I think it is not consistent with reference genome. Thanks,

Fuyou

sunnycqcn commented 4 years ago

Hello, If I only use RAGOO do scaffolds or 3D-DNA, it looks that I can get good results. But I am not sure it is correct. This is corrected contigs by 3D-DNA with RAGOO. BORR This is scaffolds by 3D-DNA. But I need break some scaffolds. BO3F

This is no-corrected contigs by RAGOO.

BOOR Thanks, Fuyou

tangerzhang commented 4 years ago

Hi Fuyou, Since you are working on the same species with your reference genome, I think it is totally OK to use RAGOO to anchor your contigs. The ragoo results look good. You can also plot a hic heatmap to validate the ragoo assembly.

sunnycqcn commented 4 years ago

Hello, I think I got it. Thanks, Fuyou

tinyfallen commented 2 years ago

Hi dear developer,

I noticed that RaGOO was updated to RagTag with better efficiency and accuracy. Could you please update the ragoo2ALLHiC.pl script to suit RagTag if you would like to, thanks!

Best!