tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

scaffolds misjoin detect #8

Closed rapaJiahe closed 5 years ago

rapaJiahe commented 5 years ago

Hi, @tanghaibao and @tangerzhang ,

my scaffolds have many misjoins, i konw you have much experience in processing HiC data. is there any suggestions to perform error correct before running Allhic?

best, he

tangerzhang commented 5 years ago

Hi @rapaJiahe It will be better to run 3D-DNA pipeline first in order to correct these misjoined contigs/scaffolds.

rapaJiahe commented 5 years ago

many thanks.

rahulvrane commented 5 years ago

Hi @tangerzhang

I have a similar problem, but after ALLHiC. I have a diploid organism with 6 + sex chromosomes and I ran partitioning with 7 clusters followed by extract, optimise and build (following your suggestion in https://github.com/tangerzhang/ALLHiC/issues/5). I have ended up with 7 large segments after build, 1 of which is clearly a mix of 3 chromosomes based on GBS data. I have 3 questions as such.

  1. How do you decide cluster number parameter for partition (-k)?
  2. Are you suggesting we run juicer + 3d-dna on the output from 'ALLHiC_build' ?
  3. Is pruning necessary for this?

Thanks a ton!

Regards R

tangerzhang commented 5 years ago

Hi Rahul,

Here is my answers to your questions:

  1. Are you suggesting we run juicer + 3d-dna on the output from 'ALLHiC_build' ?

Response: No. I am suggesting to run juicer+3d-dna before running ALLHiC. You can use the corrected contigs of 3d-dna as input of ALLHiC.

  1. Is pruning necessary for this?

Response: No. Pruning is not necessary in your case.

  1. How do you decide cluster number parameter for partition (-k)?

Response: We decided the parameter k based on the number of chromosomes or the number of groups that we would like to partition. Our experience reveals that Hi-C technology is not good at partition contigs and usually create large groups containing chimeric chromosomes. But I have some suggestions to solve this problem:

(1) Use 3D-DNA to correct mis-joined contigs and then run ALLHiC as you did before (recommended).

(2) Try different k value (partition) in ALLHiC, especially on the chimeric chromosome you mentioned (not recommended, but you can try).

(3) If you have a genetic map, you can cluster contigs based on genetic map before running ALLHiC. After that, you can use ALLHiC (rescue, optimize and build) to anchor more contigs into chromosome level assembly (recommended).

Hope this is useful!

Rahul Vivek Rane notifications@github.com 于2019年2月13日周三 上午11:23写道:

Hi @tangerzhang https://github.com/tangerzhang

I have a similar problem. I have a diploid organism with 6 + sex chromosomes and I ran partitioning with 7 clusters followed by extract, optimise and build (following your suggestion in #5 https://github.com/tangerzhang/ALLHiC/issues/5). I have ended up with 7 large segments after build, 1 of which is clearly a mix of 3 chromosomes based on GBS data. I have 3 questions as such.

  1. How do you decide cluster number parameter for partition (-k)?
  2. Are you suggesting we run juicer + 3d-dna on the output from 'ALLHiC_build' ?
  3. Is pruning necessary for this?

Thanks a ton!

Regards R

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tangerzhang/ALLHiC/issues/8#issuecomment-463041725, or mute the thread https://github.com/notifications/unsubscribe-auth/AGG2qiLVIpyrlLacr2g7jCqEewq-B6iGks5vM4VNgaJpZM4azxeH .

rahulvrane commented 5 years ago

Thanks for your response @tangerzhang

I have gathered enough evidence to go with (3). I am however struggling with the structure of the cluster file ..

Pls could you help by explaining the format of the 'clusters' file used in the ALLHiC rescue -c option?

Thanks Regards Rahul

tangerzhang commented 5 years ago

Hi Rahul, The cluster file includes three columns, as the following:

  1. group name
  2. number of contigs in this group
  3. contig names in each group separated by a blank space

image

xuzhougeng commented 5 years ago

Hi @tangerzhang

Thanks for your suggestion of "(1) Use 3D-DNA to correct mis-joined contigs and then run ALLHiC as you did before (recommended)." ,

I have a question about it, which output of 3d-dna pipeline should I use. Should I use the rawchrom.fasta as the input of ALLHiC?

tangerzhang commented 5 years ago

Hi @xuzhougeng , I shared a PERL script (https://github.com/tangerzhang/ALLHiC/blob/master/scripts/release3DDNA.pl), which can be used to process the output of 3D-DNA. After finish 3D-DNA, you will get a file named like "*.FINAL.fasta" (e.g. seq.FINAL.fasta). My script will use this file to trace, record and rename the corrected contigs generated by 3D-DNA. You can use "tig.HiCcorrected.fasta" as input of ALLHiC.

xuzhougeng commented 5 years ago

Hi @tangerzhang

ALLHiC works very well in my species, thanks for your kindly help.

image

But there still are some scaffold are wrongly connect, should I manually split them in the groups.agp and build the final scafolld , or try different k in partition step ?

tangerzhang commented 5 years ago

Hi @xuzhougeng , most of scaffolds are well organized in your case. Partition with different K values will not help you much. I would suggest to manually correct the mis-clustered groups and large rearrangments that you do not trust. If you have genetic maps, you can integrate the genetic maps and Hi-C maps using ALLMAPS program to get more accurate assembly.

wyim-pgl commented 4 years ago

@tangerzhang How can I convert ALLHIC result to ALLMAPS input?

tangerzhang commented 4 years ago

@tangerzhang How can I convert ALLHIC result to ALLMAPS input?

Please use this script to convert ALLHiC result to ALLMAPS input: https://github.com/tangerzhang/ALLHiC/blob/master/scripts/ALLHiC2ALLMAPS.pl

shehongbing commented 4 years ago

Hi, (1) it means, after running juicer and 3d-DNA, seq.FINAL.fasta was got. Then run the commend (if k=6): perl ALLHiC2ALLMAPS.pl 6 seq.FINAL.fasta; after that, should I use ALLHiC rescue, optimize and build ?

(2) and another question, if I have genetic map; how can I used ALLMAP; I don't the command

thx

tangerzhang commented 4 years ago

Hi, (1) it means, after running juicer and 3d-DNA, seq.FINAL.fasta was got. Then run the commend (if k=6): perl ALLHiC2ALLMAPS.pl 6 seq.FINAL.fasta; after that, should I use ALLHiC rescue, optimize and build ?

(2) and another question, if I have genetic map; how can I used ALLMAP; I don't the command

thx

Hi @shehongbing , It seems that you are asking a quit different question. The script ALLHiC2ALLMAPS.pl is used to convert hic map to the input format for ALLMAPS. ALLMAPS is another program that can incorporate multiple maps (e.g. genetic maps, hic maps and synteny maps) into chromosomal level assembly, which is also developed by our team (https://github.com/tanghaibao/jcvi/wiki/ALLMAPS). I assume that you are looking for another script (https://github.com/tangerzhang/ALLHiC/blob/master/scripts/release3DDNA.pl). This script was not officially supported. I use it for two functions: 1) rename the Hi-C corrected contigs (file: tigHiCcorrected.fasta) and 2) extract the top N of scaffolds from 3D-DNA output and treat them as chromosomes (for example, N=6 if the species have 6 pairs of chromosomes). If the length of chromosomes are expected, I will stop here and use 3D-DNA results as final. Otherwise, I will map Hi-C reads into 3D-DNA corrected contigs (tigHiCcorrected.fasta) and run ALLHiC pipeline. These steps are my best practice for diploid genome. For the second question, you can use ALLMAPS to incorporate genetic maps and hic map. The commands for ALLMAPS can be found in the link aforementioned.

leeun67 commented 4 years ago

dear author, from the survey i know my species is near 480M, Hybrid rate is 1.2%, Duplication is 59%,could you please recommend a method for removing hybrid?

tangerzhang commented 4 years ago

Hi @leeun67 There are a couple of programs that can be used to remove heterozygous sequences. You can try purge_haplotigs, which is a read-depth based approach, or our recently developed Khaper program, which is a Kmer-counting based method. https://bitbucket.org/mroachawri/purge_haplotigs/src/master/ https://github.com/lardo/khaper

leeun67 commented 4 years ago

Hi @tangerzhang

ALLHiC works very well in my species, thanks for your kindly help.

image

But there still are some scaffold are wrongly connect, should I manually split them in the groups.agp and build the final scafolld , or try different k in partition step ?

hi, could you tell me the method to make the graph?

HeQSun commented 4 years ago

Hi @leeun67 There are a couple of programs that can be used to remove heterozygous sequences. You can try purge_haplotigs, which is a read-depth based approach, or our recently developed Khaper program, which is a Kmer-counting based method. https://bitbucket.org/mroachawri/purge_haplotigs/src/master/ https://github.com/lardo/khaper

Hi, thanks for these suggestions, but I don't get this reply clearly.

I suppose @leeun67 is assembling a heterozygous diploid genome, and there would be two haploid genomes expected to be assembled (however, if he is assembling a mixed haploid genome of two haplotypes, I can see purging would help here). Then, how does purge_haplotigs help here? I see it would group a set of contigs as primary contigs, and the remaining as haplotigs. Now we have two groups of contigs (and each group "behaves" as a haploid set), but this does not mean that in each group, all contigs belong to the same haplotype -- so we still need phasing (and scaffolding later).

How does ALLHiC perform phasing with the purged version of assembly, where (theoretically) a half of haplotypes are missing?

Maybe I mis-understood your reply, but I am looking forward to your further explanations.

Thank you very much in advance!

Best, Hequan

tangerzhang commented 4 years ago

Hi @HeQSun , you can not phase two haplotypes after using purge_haplotigs. The purge_haplotigs is used to remove heterozygous sequences.

leeun67 commented 4 years ago

hi,@tangerzhang. after purge_haplotig,and 3d-dna,i use allhic and skip the prune step for my high heterozygous species., finally,i find that the have a scaffold is too large,but better than before. so ,(1)could you remend i to extract the scaffolf that in 3ddna look better,and then put the rest of contig into allhic? (2)how can i to manually correct the results of allhic? (3)and results of purge_haplotig include a 450MB x.fasta file, and a 70MB x.haplotigs.fasta so the x.fasta file is my aim to put in allhic?

tangerzhang commented 4 years ago

Hi @leeun67 For the second question, you can use juicerbox assembly tools to correct the assembly results. I have not used that yet, but it looks like a wonderful tool to adjust the Hi-C scaffolds. For the third question, the 450Mb fasta file contains primary contigs and it is your target file that can be used as input of ALLHiC. Sorry for that I can not quite understand your first question.

leeun67 commented 4 years ago

Hi @leeun67 For the second question, you can use juicerbox assembly tools to correct the assembly results. I have not used that yet, but it looks like a wonderful tool to adjust the Hi-C scaffolds. For the third question, the 450Mb fasta file contains primary contigs and it is your target file that can be used as input of ALLHiC. Sorry for that I can not quite understand your first question.

hi,dear author. thanks for your answer.the first question is that hava n scaffold in 3d-dna seem good ,can i extract this scaffold as chromosome. and put another contig to allhic,the k value set is the total number of chromosome reduce n (k-n)?

and(2) after purge haplotig,can i skip the prune step?

ptranvan commented 4 years ago

Hi @tangerzhang,

I have the same issue and will try what you recommend. I understant that I have to use tig.HiCcorrected.fasta but what are the files chr.fasta and groups.asm.fasta ?

tangerzhang commented 4 years ago

Hi @tangerzhang,

I have the same issue and will try what you recommend. I understant that I have to use tig.HiCcorrected.fasta but what are the files chr.fasta and groups.asm.fasta ?

Hi @ptranvan , chr.fasta and groups.asm.fasta should be the same with seq.FINAL.fasta, which was the scaffolds produced by 3D-DNA. In the script release3DDNA.pl, I extracted the longest top N (N is the chromosome number defined by users) sequences from seq.FINAL.fasta as chromosomes (chr.fasta) and then splitted the chromosomes into contigs (tig.HiCcorrected.fasta). Meanwhile, I recorded the ordering and orientation of these corrected contigs into *.tour files. After that, ALLHiC_build was used to generate groups.agp.

baishengjun commented 4 years ago

Hi @tangerzhang , I use release3ddna.pl to generate tig.HiCcorrected.fasta, and It has some very small contigs, Does it has some effect? tig0000001 464605 tig0000002 682812 tig0000003 9 tig0000004 675361 tig0000005 521686 tig0000006 28092 tig0000007 279562 tig0000008 5248 tig0000009 277551 tig0000010 370199 tig0000011 2549853 tig0000012 31419 tig0000013 158136 tig0000014 477158 tig0000015 85271 tig0000016 77000 tig0000017 159330 tig0000018 20052 tig0000019 189254 tig0000020 52732 tig0000021 212013 tig0000022 25000 tig0000023 1020617 tig0000024 82000 tig0000025 15415 tig0000026 2030584 tig0000027 69000 tig0000028 37000 tig0000029 912137 tig0000030 47595 tig0000031 684795 tig0000032 320472 tig0000033 8 tig0000034 8 tig0000035 1 tig0000036 4 tig0000037 3 tig0000038 1866099 tig0000039 305637 tig0000040 218280 tig0000041 133430 tig0000042 16

tangerzhang commented 4 years ago

Hi @tangerzhang , I use release3ddna.pl to generate tig.HiCcorrected.fasta, and It has some very small contigs, Does it has some effect? tig0000001 464605 tig0000002 682812 tig0000003 9 tig0000004 675361 tig0000005 521686 tig0000006 28092 tig0000007 279562 tig0000008 5248 tig0000009 277551 tig0000010 370199 tig0000011 2549853 tig0000012 31419 tig0000013 158136 tig0000014 477158 tig0000015 85271 tig0000016 77000 tig0000017 159330 tig0000018 20052 tig0000019 189254 tig0000020 52732 tig0000021 212013 tig0000022 25000 tig0000023 1020617 tig0000024 82000 tig0000025 15415 tig0000026 2030584 tig0000027 69000 tig0000028 37000 tig0000029 912137 tig0000030 47595 tig0000031 684795 tig0000032 320472 tig0000033 8 tig0000034 8 tig0000035 1 tig0000036 4 tig0000037 3 tig0000038 1866099 tig0000039 305637 tig0000040 218280 tig0000041 133430 tig0000042 16

Hi @baishengjun It means that your initial genome assembly likely contain many chimeric errors and 3D-DNA disrupt these misjoined contigs. You can filter the contigs shorter than 2 kb. These small contigs will not affect ALLHiC results because the short contigs do not have enough restriction sites, and thus will not be included in HiC scaffolding.