How to improve grouping and scaffolding?

ddelgadillod commented 3 years ago

Hi

I'm running AllHiC to phasing/scaffolding tetraploid potato genome, with draft assembly of 6692 contigs and 2.4 Gb length I run every step in the wiki tutorial for sugarcane, How can I improve grouping and scaffolding, and how can I avoid extremely longer contig, I can see that intermediate group1 files (txt, .tour) are much longer than other group' files.

I run:

bwa index -a bwtsw canu_asm.cntgs_v1.fasta  
samtools faidx canu_asm.cntgs_v1.fasta

##'Aligning Hi-C reads to the draft assembly'

bwa aln -t 20 canu_asm.cntgs_v1.fasta HiC_R1.fastq.gz > sample_R1.sai  
bwa aln -t 20 canu_asm.cntgs_v1.fasta .HiC_R2.fastq.gz > sample_R2.sai  
bwa sampe canu_asm.cntgs_v1.fasta sample_R1.sai sample_R2.sai HiC_R1.fastq.gz HiC_R2.fastq.gz > sample.bwa_aln_canu_v1_HiC.sam  

##'Filtering SAM file'

PreprocessSAMs.pl sample.bwa_aln_canu_v1_HiC.sam canu_asm.cntgs_v1.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln_canu_v1_HiC.REduced.paired_only.bam sample.clean.sam
samtools view -bt canu_asm.cntgs_v1.fasta.fai sample.clean.sam > sample.clean.bam

##'Make Alle.cntg.table'
## Following issue 16 instructions with diploide potato genome anotation file
gmap2AlleleTable.pl RH89-039-16_potato_gene_models.v3.gff3

##'AllHiC Prune'

ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r canu_asm.cntgs_v1.fasta

At partition step I run with values k = 4,6,12,48, the restriction enzyme is DpnII GATC same that MBOI

##'AllHiC partition'
ALLHiC_partition -r canu_asm.cntgs_v1.fasta -b prunning.bam -e GATC -k 12 -m 25

##'AllHiC rescue' 
ALLHiC_rescue -r canu_asm.cntgs_v1.fasta -b sample.clean.bam -c prunning.clusters.txt -i prunning.counts_GATC.txt

##'AllHiC optimize'
allhic extract sample.clean.bam canu_asm.cntgs_v1.fasta --RE GATC
for K in {1..12};do allhic optimize group${K}.txt sample.clean.clm;done

##'AllHiC Build' 

ALLHiC_build canu_asm.cntgs_v1.fasta

At this point, I run a Quast analysis and see that with every k value, I obtain a supercontig of 2.2 Gb and a total length of 2.4Gb in groups.asm.fasta

I review every step and resulting logs and everything run success.

tangerzhang commented 3 years ago

Hi @ddelgadillod When scaffolding the tetraploid sugarcane genome, I first assigned contigs into 10 homologous groups based on its close relatives, sorghum genome following this wiki guideline (https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-an-auto-polyploid-sugarcane-genome). After that, contigs in each group were subject to ALLHiC phasing pipeline as you showed above. Did you use the same method when you mean that you ran every step in the wiki tutorial for sugarcane? There are a couple of factors that may affect the phasing and scaffolding according our experiences:

chimeric assembly errors in contigs
collapsed assembly in contigs
We recently found that HindIII performs better than DpnII/MboI.

Can you perform the dotplot analysis between the ALLHiC scaffolds and diploid reference? This will help you check the chimeric scaffolds in ALLHiC results.

phrh commented 3 years ago

Hi, The problem was the group assignment to the homologous groups. We divided the contigs into the 12 homologous chromosomes divided by two, thus 24 (chri_1, chri_2) because we are using as a reference a haplotype assembly of a diploid potato. For sugarcane, were you able to recover the haplotype assembly for each chromosome? What is your suggestion to obtain a haplotype assembly of a tetraploid potato using a diploid reference assembly to separate the contigs?

Did you use purge_haplotigs or something similar after using allhic?

Looking forward to an answer Best Regards

tangerzhang / ALLHiC

How to improve grouping and scaffolding? #80