Open ptranvan opened 4 years ago
Hi @ptranvan For the question 1 and 2, the answer is yes. You can follow this pipeline to anchor a diploid genome (https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome). There is no need to use a related species if you are working on a diploid genome. For question 3, raw reads should be OK.
Hi @tangerzhang, I am trying to applied your pipeline:
bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam
ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e GATC -k 12
But got this issue at the end:
12:08:24 skipContigsWithFewREs | NOTICE Marked 619 contigs (avg 15.0 RE sites, len 9977) since they contain too few REs (MinREs = 25)
12:08:24 ReadCSVLines | NOTICE Parse csvfile `sample.clean.pairs.txt`
12:08:24 mustOpen | CRITIC open sample.clean.pairs.txt: no such file or directory
Hi @ptranvan The error is likely caused by incomplete bam file or contig assembly. Could you please double check your bam file and fasta contig file? Otherwise, you can paste several lines of your bam and fasta file. We would be happy to check these files for you. The file "sample.clean.pairs.txt" can be generated using the following command line:
allhic extract sample.clean.bam seq.fasta --RE GATC
After you have checked the bam and fasta file, you can use this command to generate "sample.clean.pairs.txt".
Here are the first lines
samtools view Tps_ALLHIC.clean.bam | head -n 3
GWNJ-0842:675:GW2004173059th:3:1101:9506:1502 113 ctg6251_racon_pilon3 17449 37 150M = 17449 0 AACTCTTTAAAAATAGATTTAGTTATTTTAACCACTCAAATTGACAATCCTTTTGATTAAAATAATTTTATTTGGTAGAACTATTTATTTGAAATAAATTATTTATTAATTTTTGTATTTACTTCTTTACCAATTTAACCGCTCAAATAN F7-777-AFAJJJFA-7---A<A-FJFF77-<FAAAF-JFFF<7--<-7--<-FFF<-JJFF--FA-FF-FA<-A-F<AAJ-JJJFF-F-<JAF-FAFJA<FJJFJJ<FJ-JJJJJJJJFJJJJJJFJJJJJJJJJJJJJ7JJJFA--A# XT:A:U
NM:i:3 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:2A44A101G0
GWNJ-0842:675:GW2004173059th:3:1101:9506:1502 177 ctg6251_racon_pilon3 17449 37 150M = 17449 0 AACTCTTTAAAAATAGATTTAGTTATTTTAACCACTCAAATTGACAATCCTTTTGATTAAAATAATTTTATTTGGTAGAACTATTTATTTGAAATAAATTATTTATTAATTTTTGTATTTACTTCTTTACCAATTTAACCGCTCAAATAN F7-777-AFAJJJFA-7---A<A-FJFF77-<FAAAF-JFFF<7--<-7--<-FFF<-JJFF--FA-FF-FA<-A-F<AAJ-JJJFF-F-<JAF-FAFJA<FJJFJJ<FJ-JJJJJJJJFJJJJJJFJJJJJJJJJJJJJ7JJJFA--A# XT:A:U
NM:i:3 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:2A44A101G0
GWNJ-0842:675:GW2004173059th:3:1101:10013:1502 113 ctg216_racon_pilon3 2122428 37 150M = 2122428 0 TATTTTGGATTTGACTATAGACCTTAATTTAAGAGTAATTTATTGATGAGTGATTTTTAAAATAGAATTATAAAAATTGGTTTTCACGATATTGTTTGAAATGCATATACCGGACTCGCTGATCCACTCCCGCTCTCACTGGTTGAGAAN FAA7-777--7<JF<JF<-A-JA7-7AA<A-<<-<FA7<FFFA7--AF7F-<FJFAA-A--<77--FFFA77-A-F-A-7JA<-A-FF<-<JFJFF<JFAJFAJF-F7<JFJAAJFFJJF<7JJAJJ<JFJJFFJ<JJJF7JFFFA<AA# XT:A:U
NM:i:3 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:8T15C124C0
head genome.fasta
>ctg4_racon_pilon3
TTAATTCATCGGTTTTTAGAGTCGACATTGTTTACTTTCCGTAACAGATGGCGCGTGGTT
GAATGTTGGCGCATGAGCTGATTGGTTCTTCCACGCCGATTTTCAAGGAGAAGCGTGGCC
AAAGCTGAAATTATTTCTACAAATGCAGAAATTTGTACGCAGATCGCGCTGATTTCTTGT
and the size of the files:
ll Tps_ALLHIC.clean.bam
-rw-r--r-- 1 user project_100363-pr-g 12990138177 Aug 14 19:10 Tps_ALLHIC.clean.bam
ll genome.fasta
-rw-r--r-- 1 user project_100363-pr-g 1262484257 Aug 2 16:37 genome.fasta
@ptranvan
Please confirm that you use the correct bam file when running the ALLHiC_partition
command. You used sample.clean.bam
but your input has Tps_ALLHIC.clean.bam
.
Haibao
yes sample.clean.bam
and Tps_ALLHIC.clean.bam
is the same file. @tanghaibao
draft.asm.fasta
and genome.fasta
is the same file too.
@ptranvan
During ALLHiC_partition
, the extract
command gets called (see @tangerzhang response above), which generates file sample.clean.pairs.txt
. Do you see this file in the current directory? If not, run the extract
command again and paste your output here.
Thanks, Haibao
@tanghaibao ok it's working now, I made a mistake.
Now I am this step:
ALLHiC_plot Tps_ALLHIC.clean.bam groups.agp chrn.list 500k pdf
but got this issue:
[14:16:16] Step 1: Get read position based on chromosome
[15:22:19] Step 2: Get chromosome length
[15:22:19] Step 3: Calculating and Drawing heatmap
[15:22:19] Calculating
[15:23:31] Drawing heatmap
[15:23:31] Drawing with bin size 500k
[15:23:34] Draw 500K_Whole_genome
Traceback (most recent call last):
File "ALLHiC/bin/ALLHiC_plot", line 257, in <module>
draw_heatmap(read_count_whole_genome, 'all', bin_size, ext)
File "ALLHiC/bin/ALLHiC_plot", line 194, in draw_heatmap
hmap = ax.imshow(np.log2(data), interpolation='nearest', cmap=cmap, aspect='auto')
File "/software/lib64/python2.7/site-packages/matplotlib/__init__.py", line 1869, in inner
return func(ax, *args, **kwargs)
File "/software/lib64/python2.7/site-packages/matplotlib/axes/_axes.py", line 5501, in imshow
im.set_data(X)
File "/software/lib64/python2.7/site-packages/matplotlib/image.py", line 654, in set_data
raise TypeError("Invalid dimensions for image data")
TypeError: Invalid dimensions for image data
@ptranvan, did you use the latest version of ALLHiC_plot? And can you show your chrn.list and first lines of groups.agp here? Because from the informations you post, we found that ALLHiC_plot did not get the chromosomes from your inputs.
Thanks @sc-zhang I finally managed to run everything. I got good contiguity at the end but the heatmap looks a bit weird no ?
I don't have too much experience on hi-c, but what we should see is only a signal on the diagonal no ?
Thanks for your help.
Hello,
1) I am working with a non model organism (a diploid insect), thus we don't have any Closely related 'diploid' species. Can I still use ALLHIC ?
2) My contig assembly is quite good (4000 contigs, N50: 1.6M, BUSCO:96%), it is polished, decontaminated and supposed to be haploid (I used purge haplotigs). Is there any options, or specific things I should know for ALLHIC ?
3) Do you recommend to filter Hi-C data (with trimmomatic for instance) or I can use raw reads ?
Thanks a lot for your help.