tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Can this process be applied to diploid plants? #5

Closed rapaJiahe closed 5 years ago

rapaJiahe commented 5 years ago

Hi, @tanghaibao and @tangerzhang ,

From discriptions, i found this pipelines work well, and I am wondering if this process can be applied to diploid plants. if it is possible, can you give me some advice on running this process?

Best, He

tangerzhang commented 5 years ago

Yes, ALLHiC can be applied to diploid plants as well. I will post the details of our best practice on simple diploid genomes once I get a chance. But, briefly, please follow the command lines below to anchor a diploid genome:

  1. mapping reads using bwa aln (same as we did in polyploidy)
  2. partition contigs into user pre-defined groups ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e AAGCTT -k 16 Note: restriction sites (-e) and number of clusters (-k) should be modified accordingly
  3. Extract CLM file and counts of restriction sites allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
  4. run optimize for ordering and orientation (can be run in parallel) allhic optimize sample.clean.counts_AAGCTT.16g1.txt sample.clean.clm allhic optimize sample.clean.counts_AAGCTT.16g2.txt sample.clean.clm ... allhic optimize sample.clean.counts_AAGCTT.16g16.txt sample.clean.clm
  5. Get chromosomal level assembly ALLHiC_build draft.asm.fasta
  6. Heatmap Plot for assembly assessment (a) get group length perl getFaLen.pl -i groups.asm.fasta -o len.txt Note: script can be found here (https://github.com/tangerzhang/my_script/blob/master/getFaLen.pl) grep 'merge.clean.counts_GATC' len.txt > chrn.list
    Note: only keep chromosomal level assembly for plotting.
    (b) plotting ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf
rapaJiahe commented 5 years ago

many thanks.

He

Epigenetics-Wang commented 5 years ago

Hi, @tanghaibao and @tangerzhang "Heatmap Plot for assembly assessment (a) get group length perl getFaLen.pl -i groups.asm.fasta -o len.txt Note: script can be found here (https://github.com/tangerzhang/my_script/blob/master/getFaLen.pl) grep 'merge.clean.counts_GATC' len.txt > chrn.list Note: only keep chromosomal level assembly for plotting."

I want to generate the file of chrn.list for ALLHIC-plot, but the chrn.list is an empty file , and i have the question about how can i grep such content "merge.clean.counts_GATC" from a file which is consist of fasta name and seq length? can you give me some advice on running this process?Thanks.

Best, Sincerely!

Epigenetics-Wang commented 5 years ago

Yes, ALLHiC can be applied to diploid plants as well. I will post the details of our best practice on simple diploid genomes once I get a chance. But, briefly, please follow the command lines below to anchor a diploid genome:

  1. mapping reads using bwa aln (same as we did in polyploidy)
  2. partition contigs into user pre-defined groups ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e AAGCTT -k 16 Note: restriction sites (-e) and number of clusters (-k) should be modified accordingly
  3. Extract CLM file and counts of restriction sites allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
  4. run optimize for ordering and orientation (can be run in parallel) allhic optimize sample.clean.counts_AAGCTT.16g1.txt sample.clean.clm allhic optimize sample.clean.counts_AAGCTT.16g2.txt sample.clean.clm ... allhic optimize sample.clean.counts_AAGCTT.16g16.txt sample.clean.clm
  5. Get chromosomal level assembly ALLHiC_build draft.asm.fasta
  6. Heatmap Plot for assembly assessment (a) get group length perl getFaLen.pl -i groups.asm.fasta -o len.txt Note: script can be found here (https://github.com/tangerzhang/my_script/blob/master/getFaLen.pl) grep 'merge.clean.counts_GATC' len.txt > chrn.list Note: only keep chromosomal level assembly for plotting. (b) plotting ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf Hi, @tanghaibao and @tangerzhang I want to generate the file of chrn.list for ALLHIC-plot, but the chrn.list is an empty file , and i have the question about how can i grep such content "merge.clean.counts_GATC" from a file which is consist of fasta name and seq length? can you give me some advice on running this process?Thanks.

Best, Sincerely!

tangerzhang commented 5 years ago

Could you please share a couple of lines in the len.txt file?

Epigenetics-Wang commented 5 years ago

Could you please share a couple of lines in the len.txt file? I used getFaLen.pl -i groups.asm.fasta -o len.txt to generate the file , the format of the file is just below. I don't know how to solve this problem. Thanks.
group7 126768508 group8 118096512 group9 117541027 004194F|arrow|pilon 3231 002396F|arrow|pilon 301128 007445F|arrow|pilon 1462 006502F|arrow|pilon 20216 007368F|arrow|pilon 3852 007215F|arrow|pilon 7274 007091F|arrow|pilon 9688 006603F|arrow|pilon 18135

Best, Sincerely!

tangerzhang commented 5 years ago

You can use the following command line: grep 'group' len.txt > chrn.list

Epigenetics-Wang commented 5 years ago

Thank you very much!@tangerzhang I will try and tell you the final results about this process.

Best, Sincerely!

Epigenetics-Wang commented 5 years ago

Hi,@tangerzhang, The code works well ! Thank you very much! I finally got the 500k pdf , but it looks not very clear, the backgroud disturbed too much , if it is possible, can you give me some advice on running this process? 500K_Whole_genome.pdf

tangerzhang commented 5 years ago

The result looks good! The reason that it looks not very clear is possibly due to low coverage of sequencing depth or low rate of valid reads. Increasing the coverage should be able to solve this problem.

Epigenetics-Wang commented 5 years ago

The result looks good! The reason that it looks not very clear is possibly due to low coverage of sequencing depth or low rate of valid reads. Increasing the coverage should be able to solve this problem.

Thank you @tangerzhang , Recently, i found something pretty strange,i have counted the length of the final assembly fasta file using scripts,as the chromosome number increases, the length becomes smaller. can you give me some advice ? I am not sure it is correct or not. Thanks. groups-length2.txt

tangerzhang commented 5 years ago

Not quite sure I can get the idea. Did you mean when increasing the k value in partition step, you get more groups numbers and the length of each group decrease?

Epigenetics-Wang commented 5 years ago

@tangerzhang , hi, I mean that when i finished all the step of ALLHIC, i got a final assembly file called groups.asm.fasta and groups.agp which stands for the description of scaffold. The chromosomal background of the material is 2n = 48 (AABB), so finally i got 24 “chromosome” sequences , i have counted the length of those sequences, i found as the chromosome id number increases, the length becomes smaller. The 24th sequence is only 182kb. I am not sure it is correct or not. Can you give me some advice ? Thanks. groups-length2.txt

tangerzhang commented 5 years ago

Group23 and group24 are too small and should not be normal. Hi-C technology is not good at partition contigs. If you have genetic maps, you can cluster contigs based on linkage group, and then order contigs from each group. Alternatively, you may try to correct the mis-joined contigs using 3D-DNA and then scaffolding the corrected contigs using ALLHiC.

baozg commented 5 years ago

Hi, @tangerzhang

If I use the 3D-DNA or the SALSA2 to correct the mis-joined contigs, should I align the HiC reads using the bwa mem? Both the software suggested the bwa mem for the PE 150 reads. ALLHiC could change the mapping pipeline to the bwa mem?

Thanks.

tangerzhang commented 5 years ago

Hi baozg, Only the filterBAM_forHiC.pl requires bwa aln. You can skip this script if you would like to use bwa mem.

cjchen5 commented 5 years ago

Hi, @tangerzhang Seems in this case we don't need Allele.ctg.table, right? Thanks!

Yes, ALLHiC can be applied to diploid plants as well. I will post the details of our best practice on simple diploid genomes once I get a chance. But, briefly, please follow the command lines below to anchor a diploid genome:

mapping reads using bwa aln (same as we did in polyploidy)
partition contigs into user pre-defined groups
ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e AAGCTT -k 16
Note: restriction sites (-e) and number of clusters (-k) should be modified accordingly
Extract CLM file and counts of restriction sites
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
run optimize for ordering and orientation (can be run in parallel)
allhic optimize sample.clean.counts_AAGCTT.16g1.txt sample.clean.clm
allhic optimize sample.clean.counts_AAGCTT.16g2.txt sample.clean.clm
...
allhic optimize sample.clean.counts_AAGCTT.16g16.txt sample.clean.clm
Get chromosomal level assembly
ALLHiC_build draft.asm.fasta
Heatmap Plot for assembly assessment
(a) get group length
perl getFaLen.pl -i groups.asm.fasta -o len.txt
Note: script can be found here (https://github.com/tangerzhang/my_script/blob/master/getFaLen.pl)
grep 'merge.clean.counts_GATC' len.txt > chrn.list
Note: only keep chromosomal level assembly for plotting.
(b) plotting
ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf
tangerzhang commented 5 years ago

Hi, @tangerzhang Seems in this case we don't need Allele.ctg.table, right? Thanks!

Yes, ALLHiC can be applied to diploid plants as well. I will post the details of our best practice on simple diploid genomes once I get a chance. But, briefly, please follow the command lines below to anchor a diploid genome:

mapping reads using bwa aln (same as we did in polyploidy)
partition contigs into user pre-defined groups
ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e AAGCTT -k 16
Note: restriction sites (-e) and number of clusters (-k) should be modified accordingly
Extract CLM file and counts of restriction sites
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
run optimize for ordering and orientation (can be run in parallel)
allhic optimize sample.clean.counts_AAGCTT.16g1.txt sample.clean.clm
allhic optimize sample.clean.counts_AAGCTT.16g2.txt sample.clean.clm
...
allhic optimize sample.clean.counts_AAGCTT.16g16.txt sample.clean.clm
Get chromosomal level assembly
ALLHiC_build draft.asm.fasta
Heatmap Plot for assembly assessment
(a) get group length
perl getFaLen.pl -i groups.asm.fasta -o len.txt
Note: script can be found here (https://github.com/tangerzhang/my_script/blob/master/getFaLen.pl)
grep 'merge.clean.counts_GATC' len.txt > chrn.list
Note: only keep chromosomal level assembly for plotting.
(b) plotting
ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf

You are right. For scaffolding simple diploid genome, we do not need Allele.ctg.table. Please check the pipeline for scaffolding diploid genome (https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome).