tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
170 stars 39 forks source link

How to correct heatmap error? #122

Open pxxiao-hz opened 2 years ago

pxxiao-hz commented 2 years ago

Dear Professor. I found a error when evaluating with the heatmapALLHiC_plot, now I want to try flipping them, but I don't know their location and ctg id. How can I know their information and change them. Thanks. Best wishes!

Yujiaxin419 commented 2 years ago

Hello,

There is an interactive software, juicebox , can help us to correct assembly mistakes. Before assembly correction, we should generate the .hic and .assembly file. Here are some brief steps on how to generate these files:

1、Re-mapping HiC reads to contig assembly.

Dependency: chromap

# Build index
chromap -i -r contig.fa -o index
# Re-mapping
chromap --preset hic -x index -r contig.fa -1 HiC_read1.fa -2 HiC_read2.fa -o aln.pair

2、Convert .agp file to .assembly file. Dependency: juicebox_scripts

python agp2assembly.py groups.agp out.assembly

2、Convert .pair file to .hic file Dependency: 3d-dna


grep -v '#' aln.pair |awk '{if($6!="+") $6=16; else $6=0; if($7!="+") $7=16; else $7=0} \
$2<=$4{print $6, $2, $3, 0, $7, $4, $5, 1, "1 - - 1  - - -" } \
$4<$2{print $7, $4, $5, 0, $6, $2, $3, 1, "1 - - 1  - - -" }'  > out.links.txt

3d-dna

bash 3d-dna/visualize/run-assembly-visualizer.sh -p false out.assembly out.links.txt



4、How to use juicebox?
There are some tutorials below which may be useful.
- [The tutorial video for Juicebox Assembly Tools in youtube](https://www.youtube.com/watch?v=Nj7RhQZHM18)
- [Genome assembly cookbook of aidenlab](http://aidenlab.org/assembly/manual_180322.pdf)

Best wishes!
pxxiao-hz commented 2 years ago

Hi,

  1. What's the Re-mapping HiC reads to contig assembly mean? Is ‘contig assembly’ ‘groups.asm.fasta’ or '.assembly' generated by reads?
  2. I just want to modify a chromosome because my genome is too big and the runtime is estimated to be very long. So can I modify only one chromosome by the following steps?

    2.1 A bam file has been generated before running allhic. I want to extract the reads mapped to a chromosome to generate a new bam file. 2.2 Generate .assebmly grep 'read_...g1' groups.agp > chr1.agp python agp2assembly.py chr1.agp chr1.assembly 2.3 Then, matlock bam2 juicer .new.bam out.links.txt sort -k2,2 -k6,6 out.links.txt > out.sorted.links.txt bash run-assembly-visualizer.sh -p false chr1.assembly out.sorted.links.txt

Thanks.

Yujiaxin419 commented 2 years ago

Hello,

  1. It is the ‘contig assembly’ and also be called 'draft.asm.fasta'.

  2. Firstly, the runtime of this pipeline is not very long. A 3GB genome can generate .hic file in about 24h. Secondly, matlock only work with proper-pair hic reads mapping bam. So, you must filter new bam with pairtools, Before work in one chromosomes. Here is the filter steps:

    
    samtools view -h new.bam | \ 
    pairtools parse -c draft.asm.sizes -o parsed.pairsam.gz
    pairtools sort --nproc 8 -o sorted.pairsam.gz parsed.pairsam.gz
    pairtools dedup --mark-dups -o deduped.pairsam.gz sorted.pairsam.gz
    pairtools select \
    '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' \
    -o filtered.pairsam.gz deduped.pairsam.gz

pairtools split --output-sam new.filtered.bam filtered.pairsam.gz



Best wishes!
Yujiaxin419 commented 2 years ago

Hello,

  1. It is the ‘contig assembly’ and also be called 'draft.asm.fasta'.
  2. Firstly, the runtime of this pipeline is not very long. A 3GB genome can generate .hic file in about 24h. Secondly, matlock only work with proper-pair hic reads mapping bam. So, you must filter new bam with pairtools, Before work in one chromosomes. Here is the filter steps:
samtools view -h new.bam | \ 
    pairtools parse -c draft.asm.sizes -o parsed.pairsam.gz
pairtools sort --nproc 8 -o sorted.pairsam.gz parsed.pairsam.gz
pairtools dedup --mark-dups -o deduped.pairsam.gz sorted.pairsam.gz
pairtools select \
    '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' \
    -o filtered.pairsam.gz deduped.pairsam.gz

pairtools split --output-sam new.filtered.bam filtered.pairsam.gz

Best wishes!

in addition, you can filter bam file with python script filter_bam_unpaired.py.

samtools sort -n new.bam > new.nameSorted.bam
python filter_bam_unpaired.py new.nameSorted.bam new.filtered.bam