tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

how to convert the ALLHIC output files into .hic files and .assembly files #68

Closed jinxin112233 closed 3 years ago

jinxin112233 commented 4 years ago

Hi I tried to use ALLHiC to scaffold my draft genome. The overall result looks good, but some contigs are mis-assembled ,because of lacking high-quality whole genome sequence of related species at present. Therefore, I want to further manually correct the results of ALLHIC through the juicerbox. However, I don't know how to convert the ALLHIC output files into .hic files and .assembly files. Can you develop related scripts to achieve better interaction between software.

best JX

tangerzhang commented 4 years ago

The PhaseGenomics has provided codes to get .hic and .assembly files that can be directly used for juicebox adjustment. Please find the codes from the following link: https://github.com/phasegenomics/juicebox_scripts Hope this is useful.

jinxin112233 commented 4 years ago

Your suggestion is useful, which solved my problem very well. Thank you for the suggestions!

Best, JX

jinxin112233 commented 4 years ago

Hi I am very interested in your recent developed ALLHIC_corrector. Since there is no relevant introduction on the wiki, I have some doubts about ALLHIC_corrector.
ALLHiC_corrector -m mapping.sorted.bam -r seq.fasta -o seq.correct2.fasta -t 12

Q1: what is mapping.sorted.bam ? how can i obtained it ? Is it sample clean.bam ? Q2: what is seq.fasta ?Is it groups.asm.fasta ? Q3: after the correction, i will get seq.correct2.fasta . Should i replace draft.asm.fasta to seq.correct2.fasta then run a new round of Hi-C scaffolding? Thank you for the help!

Best JX

tangerzhang commented 4 years ago

Hi @jinxin112233 A1: You can use bwa mem and samtools sort to generate the sorted bam file. A2: seq.fasta is the draft contig file that was assembled by PacBio/Nanopore assemblers. A3: Yes, seq.correct2.fasta is the corrected contig file that can be used for further Hi-C scaffolding. ALLHiC_corrector utilized the core algorithm from 3D-DNA to correct initial contig level assembly, but was improved at speed. I will update the wiki once I get a chance.

jinxin112233 commented 4 years ago

Hi Thank you for the suggestion, I am trying this script these days. I am look forward to the improvement of Hi-C scaffolding results by using ALLHIC_corrector.

Best JX

yilunhuangyue commented 3 years ago

Hi, jinxin, Have you successfully get .hic file using the result of AllHic? I followed the pipeline by https://github.com/phasegenomics/juicebox_scripts, but get errors as "FATAL: something went wrong in process_pair" when using matlock. Could you give me some information? Which bam file should be used here?

Thanks in advance for any help~

baozg commented 3 years ago

Workflow for convert the AllHiC result to juicebox:

# sort the bam by read name 
samtools sort -n -@ 12 -o aligned.sort_name.bam aligned.bam 

# matlock bam2juicer

matlock bam2 juicer aligned.sort_name.bam mnd.txt

# agp to assembly
python agp2assembly.py in.agp out.assembly

# juicetools

bash 3d-dna/visualize/run-assembly-visualizer.sh -q 1 -p true mnd.txt out.assembly 
yilunhuangyue commented 3 years ago

Thank you so much for your quick reply. I still get the error "FATAL: something went wrong in process_pair" when using matlock. I think there might be some problem in my sorted bam file...

amit4mchiba commented 3 years ago

Hello,

I am writing here to ask a question based on this thread.

I performed scaffolding using AllHiC, and 3D method, and my results were best for AllHiC. So, I want to use that for my final assembly. Next, I performed gap filling using pbjelly, and then realized that the final assembly has a scaffold of size 170Mb, which is not correct. I then checked AllHiC result, and could see the same thing. So, I wanted to correct this using Juicer method.

I followed the method you describe above, used .bwa_aln.bam file to get aligned.sort_name.bam, and then used matlock as you suggested. Next, I created .assembly file from .agp, and then run the final step, but I am always getting following error- :) -q flag was triggered, starting calculations for 1 threshold mapping quality :) -p flag was triggered. Running with GNU Parallel support parameter set to true. ...Remapping contact data from the original contig set to assembly :( Assembly file does not match cprops file. Exiting! ...Building track files :( Assembly file does not match cprops file. Exiting! ...Building the hic file temp.Gg_out.sorted.links.txt.asm_mnd.txt does not exist or does not contain any reads.

I next used bowtie, and perform mapping of HiC reads using script from HiC-pro. The alligned file then were used as you suggested above, but I got the same error again. I wonder as how to create cprops file, and how to make it same as .assembly file.

I am not getting how to do it here. I wanted to use Juicer to correct and redraw chromosome boundary before finalizing the assembly. I will really appreciate your advise here. Please let me know if there is anything you need to know in order to help me out here.

thank you so much in advance,

with best regards Amit

sandipmkale commented 2 years ago

Thank you so much for your quick reply. I still get the error "FATAL: something went wrong in process_pair" when using matlock. I think there might be some problem in my sorted bam file..

Hi, jinxin, Have you successfully get .hic file using the result of AllHic? I followed the pipeline by https://github.com/phasegenomics/juicebox_scripts, but get errors as "FATAL: something went wrong in process_pair" when using matlock. Could you give me some information? Which bam file should be used here?

Thanks in advance for any help~

Hello yilunhuangyue

Have you solved the issue. I am also facing same issue. I contacted on Matlock GitHub but they also don't have the solution.

Thanks and regards

Sandip

shengxinzhuan commented 2 years ago

Workflow for convert ALLHiC result to JuicerBox

############################################
###            Input Data                ###
# allhic cluster result--- groups.agp
# bwa alignment result--- sample.bwa_aln.bam
############################################

############################################
###            Sort Bam                  ###
############################################
samtools sort -n -@ 2 sample.bwa_aln.bam  -o sample.sort.bam

############################################
###            Convert Format            ###
############################################

# matlock bam2juicer (matlock can install by conda)
matlock bam2 juicer sample.sort.bam out.links.txt
# sort out.links.txt
sort -k2,2 -k6,6 out.links.txt > out.sorted.links.txt

# agp to assembly (scripts in https://github.com/phasegenomics/juicebox_scripts)
python agp2assembly.py groups.agp out.assembly

##############################################
###            Generate .hic file          ###
##############################################

# 3d-dna to generate .hic
bash 3d-dna/visualize/run-assembly-visualizer.sh -q 1 -p true out.assembly  out.sorted.links.txt

These errors you may meet

# when you use the raw bam file that allhic processed
# sort the bam and try to convert to out.links.txt
INFO: converting bam to juicer on test.sort.bam
INFO: detected bam filetype
INFO: reading file "test.sort.bam"
FATAL: something went wrong in process_pair

# when you not sort you out.links.txt
# using the run-assembly-visualizer.sh to generate the .hic file
...Remapping contact data from the original contig set to assembly
:( Assembly file does not match cprops file. Exiting!
...Building track files
:( Assembly file does not match cprops file. Exiting!
...Building the hic file
temp.out.txt.asm_out.txt does not exist or does not contain any reads.
Yujiaxin419 commented 2 years ago

Hi,

Generally speaking, the problem , FATAL: something went wrong in process_pair, should caused by unpaired alignment of hic reads in .bam files. When I get this error, I will filter out unpaired hic alignment from .bam files.

Here is a pipeline I usually use to filter .bam file then adjust allhic assembly through juicebox: https://github.com/Yujiaxin419/ALLHiC/wiki/Manually-refine-ALLHiC-scaffold-assembly-through-juicebox

I hope it is helpful.

Yujiaxin

Thank you so much for your quick reply. I still get the error "FATAL: something went wrong in process_pair" when using matlock. I think there might be some problem in my sorted bam file..

Hi, jinxin, Have you successfully get .hic file using the result of AllHic? I followed the pipeline by https://github.com/phasegenomics/juicebox_scripts, but get errors as "FATAL: something went wrong in process_pair" when using matlock. Could you give me some information? Which bam file should be used here? Thanks in advance for any help~

Hello yilunhuangyue

Have you solved the issue. I am also facing same issue. I contacted on Matlock GitHub but they also don't have the solution.

Thanks and regards

Sandip

zhaotao1987 commented 2 years ago

@shengxinzhuan Indeed, I've got a similar problem. bam was sorted... but still failed to generate out.links.txt

INFO: converting bam to juicer on hic_reads.aligned.sorted.cleaned.bam
INFO: detected bam filetype
INFO: reading file "hic_reads.aligned.sorted.cleaned.bam"
INFO: parsed 1000000 read pairs
INFO: parsed 2000000 read pairs
.........
INFO: parsed 126000000 read pairs
:) -p flag was triggered. Running with GNU Parallel support parameter set to false.
...Remapping contact data from the original contig set to assembly
:( Assembly file does not match cprops file. Exiting!
...Building track files
:( Assembly file does not match cprops file. Exiting!
...Building the hic file
temp.hifi_cleanreads.hic.hap1.p_ctg.gfa.fasta.asm_mnd.txt does not exist or does not contain any reads.
Yujiaxin419 commented 2 years ago

@zhaotao1987 Hello,

can you check whether your .assembly file and your .bam file include same contig or not?

shengxinzhuan commented 2 years ago

@zhaotao1987 Which sam file have you treat as the raw file. Actually, I got this error because i use the bam filtered with other parameters. Just use the sam got from bwa. 其实就是直接用bwa比对完就sort一下,就不会出错,别用其他过滤参数,保证所有的reads都存在于sam文件中才不会报错

Biscuite-wzy commented 2 years ago

Hi, I met the same problem, FATAL: something went wrong in process_pair. In this, I used the bam that is "sample.clean.bam" generated by the step "Filtering SAM file" in ALLHiC to run the command, matlock bam2 juicer sample.clean.bam out.links.txt. How can I resolve it?

Yujiaxin419 commented 2 years ago

Hi, @Biscuite-wzy This problem usually caused by reads didn't proper paired. You can follow this tutorial to solve this problem: (https://github.com/Yujiaxin419/ALLHiC/wiki/Manually-refine-ALLHiC-scaffold-assembly-through-juicebox). hope it's helpful. Yujiaxin

Biscuite-wzy commented 2 years ago

Hi, @Yujiaxin419 I used this tutorial (https://github.com/Yujiaxin419/ALLHiC/wiki/Manually-refine-ALLHiC-scaffold-assembly-through-juicebox) to get a .hic file. My commond is:
nohup bash ~/3d-dna-201008/visualize/run-assembly-visualizer.sh -p false groups.assembly out.sorted.links.txt & When it runs for a while, It occurred exit not done. But there is no errors in the log file. Does it means done? The contents of the log file are as follows:

:) -p flag was triggered. Running with GNU Parallel support parameter set to false. ...Remapping contact data from the original contig set to assembly ...Building track files ...Building the hic file Not including fragment map Start preprocess Writing header Writing body .. Writing footer

Finished preprocess HiC file version: 8

Calculating norms for zoom BP_2500000 Calculating norms for zoom BP_1000000 Calculating norms for zoom BP_500000 Calculating norms for zoom BP_250000 Calculating norms for zoom BP_100000 Calculating norms for zoom BP_50000 Calculating norms for zoom BP_25000 Calculating norms for zoom BP_10000 Calculating norms for zoom BP_5000 Calculating norms for zoom BP_1000 Writing expected Writing norms Finished writing norms

shengxinzhuan commented 2 years ago

@Biscuite-wzy just using the raw file from bwa mem,sort it to a bam file and don't filter anything