tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Non model organism and trim reads ? #56

Open ptranvan opened 4 years ago

ptranvan commented 4 years ago

Hello,

1) I am working with a non model organism (a diploid insect), thus we don't have any Closely related 'diploid' species. Can I still use ALLHIC ?

2) My contig assembly is quite good (4000 contigs, N50: 1.6M, BUSCO:96%), it is polished, decontaminated and supposed to be haploid (I used purge haplotigs). Is there any options, or specific things I should know for ALLHIC ?

3) Do you recommend to filter Hi-C data (with trimmomatic for instance) or I can use raw reads ?

Thanks a lot for your help.

tangerzhang commented 4 years ago

Hi @ptranvan For the question 1 and 2, the answer is yes. You can follow this pipeline to anchor a diploid genome (https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome). There is no need to use a related species if you are working on a diploid genome. For question 3, raw reads should be OK.

ptranvan commented 4 years ago

Hi @tangerzhang, I am trying to applied your pipeline:

bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai  
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai  
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam  

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e GATC -k 12

But got this issue at the end:

12:08:24 skipContigsWithFewREs | NOTICE  Marked 619 contigs (avg 15.0 RE sites, len 9977) since they contain too few REs (MinREs = 25)
12:08:24 ReadCSVLines | NOTICE  Parse csvfile `sample.clean.pairs.txt`
12:08:24 mustOpen | CRITIC  open sample.clean.pairs.txt: no such file or directory
tangerzhang commented 4 years ago

Hi @ptranvan The error is likely caused by incomplete bam file or contig assembly. Could you please double check your bam file and fasta contig file? Otherwise, you can paste several lines of your bam and fasta file. We would be happy to check these files for you. The file "sample.clean.pairs.txt" can be generated using the following command line:

allhic extract sample.clean.bam seq.fasta --RE GATC

After you have checked the bam and fasta file, you can use this command to generate "sample.clean.pairs.txt".

ptranvan commented 4 years ago

Here are the first lines

samtools view Tps_ALLHIC.clean.bam | head -n 3

GWNJ-0842:675:GW2004173059th:3:1101:9506:1502   113     ctg6251_racon_pilon3    17449   37      150M    =       17449   0       AACTCTTTAAAAATAGATTTAGTTATTTTAACCACTCAAATTGACAATCCTTTTGATTAAAATAATTTTATTTGGTAGAACTATTTATTTGAAATAAATTATTTATTAATTTTTGTATTTACTTCTTTACCAATTTAACCGCTCAAATAN        F7-777-AFAJJJFA-7---A<A-FJFF77-<FAAAF-JFFF<7--<-7--<-FFF<-JJFF--FA-FF-FA<-A-F<AAJ-JJJFF-F-<JAF-FAFJA<FJJFJJ<FJ-JJJJJJJJFJJJJJJFJJJJJJJJJJJJJ7JJJFA--A#        XT:A:U
        NM:i:3  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:3  XO:i:0  XG:i:0  MD:Z:2A44A101G0
GWNJ-0842:675:GW2004173059th:3:1101:9506:1502   177     ctg6251_racon_pilon3    17449   37      150M    =       17449   0       AACTCTTTAAAAATAGATTTAGTTATTTTAACCACTCAAATTGACAATCCTTTTGATTAAAATAATTTTATTTGGTAGAACTATTTATTTGAAATAAATTATTTATTAATTTTTGTATTTACTTCTTTACCAATTTAACCGCTCAAATAN        F7-777-AFAJJJFA-7---A<A-FJFF77-<FAAAF-JFFF<7--<-7--<-FFF<-JJFF--FA-FF-FA<-A-F<AAJ-JJJFF-F-<JAF-FAFJA<FJJFJJ<FJ-JJJJJJJJFJJJJJJFJJJJJJJJJJJJJ7JJJFA--A#        XT:A:U
        NM:i:3  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:3  XO:i:0  XG:i:0  MD:Z:2A44A101G0
GWNJ-0842:675:GW2004173059th:3:1101:10013:1502  113     ctg216_racon_pilon3     2122428 37      150M    =       2122428 0       TATTTTGGATTTGACTATAGACCTTAATTTAAGAGTAATTTATTGATGAGTGATTTTTAAAATAGAATTATAAAAATTGGTTTTCACGATATTGTTTGAAATGCATATACCGGACTCGCTGATCCACTCCCGCTCTCACTGGTTGAGAAN        FAA7-777--7<JF<JF<-A-JA7-7AA<A-<<-<FA7<FFFA7--AF7F-<FJFAA-A--<77--FFFA77-A-F-A-7JA<-A-FF<-<JFJFF<JFAJFAJF-F7<JFJAAJFFJJF<7JJAJJ<JFJJFFJ<JJJF7JFFFA<AA#        XT:A:U
        NM:i:3  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:3  XO:i:0  XG:i:0  MD:Z:8T15C124C0
head genome.fasta

>ctg4_racon_pilon3
TTAATTCATCGGTTTTTAGAGTCGACATTGTTTACTTTCCGTAACAGATGGCGCGTGGTT
GAATGTTGGCGCATGAGCTGATTGGTTCTTCCACGCCGATTTTCAAGGAGAAGCGTGGCC
AAAGCTGAAATTATTTCTACAAATGCAGAAATTTGTACGCAGATCGCGCTGATTTCTTGT

and the size of the files:

ll Tps_ALLHIC.clean.bam
-rw-r--r-- 1 user project_100363-pr-g 12990138177 Aug 14 19:10 Tps_ALLHIC.clean.bam

ll genome.fasta
-rw-r--r-- 1 user project_100363-pr-g 1262484257 Aug  2 16:37 genome.fasta
tanghaibao commented 4 years ago

@ptranvan

Please confirm that you use the correct bam file when running the ALLHiC_partition command. You used sample.clean.bam but your input has Tps_ALLHIC.clean.bam.

Haibao

ptranvan commented 4 years ago

yes sample.clean.bamand Tps_ALLHIC.clean.bam is the same file. @tanghaibao

draft.asm.fasta and genome.fasta is the same file too.

tanghaibao commented 4 years ago

@ptranvan

During ALLHiC_partition, the extract command gets called (see @tangerzhang response above), which generates file sample.clean.pairs.txt. Do you see this file in the current directory? If not, run the extract command again and paste your output here.

Thanks, Haibao

ptranvan commented 4 years ago

@tanghaibao ok it's working now, I made a mistake.

Now I am this step:

ALLHiC_plot Tps_ALLHIC.clean.bam groups.agp chrn.list 500k pdf

but got this issue:

[14:16:16] Step 1: Get read position based on chromosome
[15:22:19] Step 2: Get chromosome length
[15:22:19] Step 3: Calculating and Drawing heatmap
[15:22:19] Calculating
[15:23:31] Drawing heatmap
[15:23:31] Drawing with bin size 500k
[15:23:34] Draw 500K_Whole_genome
Traceback (most recent call last):
  File "ALLHiC/bin/ALLHiC_plot", line 257, in <module>
    draw_heatmap(read_count_whole_genome, 'all', bin_size, ext)
  File "ALLHiC/bin/ALLHiC_plot", line 194, in draw_heatmap
    hmap = ax.imshow(np.log2(data), interpolation='nearest', cmap=cmap, aspect='auto')
  File "/software/lib64/python2.7/site-packages/matplotlib/__init__.py", line 1869, in inner
    return func(ax, *args, **kwargs)
  File "/software/lib64/python2.7/site-packages/matplotlib/axes/_axes.py", line 5501, in imshow
    im.set_data(X)
  File "/software/lib64/python2.7/site-packages/matplotlib/image.py", line 654, in set_data
    raise TypeError("Invalid dimensions for image data")
TypeError: Invalid dimensions for image data
sc-zhang commented 4 years ago

@ptranvan, did you use the latest version of ALLHiC_plot? And can you show your chrn.list and first lines of groups.agp here? Because from the informations you post, we found that ALLHiC_plot did not get the chromosomes from your inputs.

ptranvan commented 4 years ago

Thanks @sc-zhang I finally managed to run everything. I got good contiguity at the end but the heatmap looks a bit weird no ?

I don't have too much experience on hi-c, but what we should see is only a signal on the diagonal no ?

Thanks for your help.

500K_Whole_genome.pdf