tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
167 stars 39 forks source link

Question about pruning step and input/output files #94

Open mentorlg opened 3 years ago

mentorlg commented 3 years ago

Hi, I have some questions about pruning step (how heterozygous alleles be partitioned...) and input/output files.

  1. What is the difference between collapsed regions and chimeric contigs in pruning step?

  2. What is the exact input file? If the monoploid (not phased) target genome size is 10G, do I have to use 20G of input assembled contigs for heterozygous information ..?? If not, assembled contigs might be already deleted heterozygous bubbles by building graphs using canu, falcon, etc. Then, how can be partitioning haplotype's alleles without heterozygous regions..? Or, is there a specific 'contig assembler program' for AllHiC?

  3. Can I use scaffold-level assembled genome as a closely related genome (reference) ..? I read chromosome-level genome is recommended as a reference for the construction of the allelic contig table. But, I was wondering scaffold-level genome can not be used for allelic contig table...

  4. What is output files..? Can I get haplotype assembled files, respectively..? ex) Autotetraploidy -> 4 assembled fasta files

Sincerely, Amelia

tangerzhang commented 3 years ago

Hi Amelia, Below are my responses: 1. What is the difference between collapsed regions and chimeric contigs in pruning step? In polyploid genomes, some of the homologous regions (i.e., allelic sequences) are highly similar. These sequences are frequently assembled into one contig since assemblers can not separate the alleles. This kind of region is collapsed regions. On the other hand, some of the contigs contain sequences from different haplotypes or non-homologous chromosomes, which can be considered as chimeric contigs. ALLHiC is designed to minimize the negative influence of collapsed contigs, and our simulation data revealed that ALLHiC is able to tolerate ~20% of collapsed contigs; however, only ~5% of chimeric contigs. 2. What is the exact input file? If the monoploid (not phased) target genome size is 10G, do I have to use 20G of input assembled contigs for heterozygous information ..?? _If you are working on assembly of the haplotype phased genome, your input contig size should be 20 Gb, twice the estimated genome size by genome survey or flow cytometry (1C). However, if you would like to generate a monoploid genome, 10 Gb should be good enough. If you are doing so, I would recommend a wrapped script for monoploid genome scaffolding (https://github.com/tangerzhang/ALLHiC/blob/master/bin/ALLHiC_pip.sh), which includes correction of misjoined contigs, partition, scaffolding and building._ If not, assembled contigs might be already deleted heterozygous bubbles by building graphs using canu, falcon, etc. _Yes. If you are working on assembly of a monoploid genome, the heterozygous sequences should be removed before Hi-C scaffolding. Purgehaplotigs is recommended if the genome size is such big. Then, how can be partitioning haplotype's alleles without heterozygous regions..? I do not think that we can partition haplotype’s alleles without heterozygous regions in ALLHiC. However, there might be an alternative strategy to construct phased assembly if you only have a monoploid genome assembly. If I remember correctly, the recently released DipAsm and 3D-DNA phasing pipeline firstly identified and phased SNPs based on PacBio long reads or Hi-C/10x genomics linked reads. Afterward, these programs partition reads into different haplotypes and generate chromosome-scale assemblies individually. Or, is there a specific 'contig assembler program' for AllHiC? Unfortunately, there is no specific 'contig assembler program' for AllHiC. 3. Can I use scaffold-level assembled genome as a closely related genome (reference) ..? I read chromosome-level genome is recommended as a reference for the construction of the allelic contig table. But, I was wondering scaffold-level genome can not be used for allelic contig table... It is OK to use scaffold-level genome for the construction of allelic contig table. However, the allelic contig table is also used for the partition of homologous groups. If these scaffolds are too fragmented, it will increase the complexity of linking these scaffolds into different haplotypes.

4. What is output files..? Can I get haplotype assembled files, respectively..? ex) Autotetraploidy -> 4 assembled fasta files That is our aim to design ALLHiC.

mentorlg commented 3 years ago

Thank you for your fast reply..! But, there were some misunderstandings about my questions because of my mistakes TT...

First, I want to assemble autopolyploid genomes, not monoploid. In number 2, I understand that I have to use contigs twice of the estimated genome size (1C). But, assembled contigs might be already deleted (not all) heterozygous bubbles by building graphs in canu, falcon, etc. then, is this possible to construct a fully phased genome assembly using contigs from canu, falcon assembler..?

And... is there any recommendation of contig assembler? (not specific to AllHiC, just your recommendation) (Because for AllHiC, I think contigs should retain heterozygous regions as much as possible, not deleted during the construction process of assembly graph)

Last, number 4, then... you mean, I can get 4 fasta files after running AllHiC, am I right???

plus, I have a new question about Hi-C signal. In pruning step, the strongest signals were selected. then...how HiC signals calculated..? What is the meaning of HiC signals..? How can be used allelic contig table information in pruning step..??

Again, I really appreciate your fast and detailed descriptions :)

Sincerely, Amelia

tangerzhang commented 3 years ago

Hi Amelia, Please see my responses below: First, I want to assemble autopolyploid genomes, not monoploid. In number 2, I understand that I have to use contigs twice of the estimated genome size (1C). But, assembled contigs might be already deleted (not all) heterozygous bubbles by building graphs in canu, falcon, etc. then, is this possible to construct a fully phased genome assembly using contigs from canu, falcon assembler..?

And... is there any recommendation of contig assembler? (not specific to AllHiC, just your recommendation) (Because for AllHiC, I think contigs should retain heterozygous regions as much as possible, not deleted during the construction process of assembly graph)

Well, that depends on which sequencing platform is used to generate reads. If the genome is sequenced using the PacBio CLR model, I recommend using CANU with the polyploid parameter suggested on the official website (https://canu.readthedocs.io/en/latest/faq.html). On the other hand, if the PacBio CCS model (i.e., HiFi reads), hifiasm or HiCANU would be good choices based on my experience.

Last, number 4, then... you mean, I can get 4 fasta files after running AllHiC, am I right???

No, you should be able to get four haplotypes in one single fasta file (groups.asm.fasta).

plus, I have a new question about Hi-C signal. In pruning step, the strongest signals were selected. then...how HiC signals calculated..? What is the meaning of HiC signals..? How can be used allelic contig table information in pruning step..??

Given contig A and B, we can count the number of paired-end reads that span contig A and B in the bam file. Afterward, the following formula could be simply used to calculate Hi-C signals: Hi-C signal density = No. of read pairs/length (contig A + contig B) The higher score indicates that the two contigs might be closely linked. Otherwise, they should not be grouped or linked together. The allelic contig table recorded information on which contigs are allelic and which are collapsed. Therefore, ALLHiC_prune will use this information to filter noisy reads that should not be used in the partition. Does that make sense?

I have two additional suggestions that may be helpful for your project.

  1. I’ve attached the pipeline used to construct a haplotype-resolved genome assembly of an auto-tetraploid sugarcane genome. Please find more details in this link: https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-an-auto-polyploid-sugarcane-genome
  2. We have updated the Prune algorithm (ATTACHED). The new version will be more efficient than the previous one. To install the new version, GCC 6.4 or a higher version will be needed. Moreover, the new version Prune uses htslib to speed up, and therefore htslib is required and should be added in the environment. Prune.tar.gz
mentorlg commented 3 years ago

I really appreciate for your response..! These were really helpful to understand about AllHiC.

I understand which contig assemblers could be used and what is the Hi-C signals.

but... If the output file is just one fasta file, how can I determine which chromosomes sets are homologous..??

tangerzhang commented 3 years ago

Hi Amelia, A dot-plot analysis between your target genome and a reference genome (could be either a close relative or a monoploid assembly) will help you to determine which chromosomes are homologous. Please see an example below: dotplot