marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Phasing of genome using Canu output #783

Closed amit4mchiba closed 6 years ago

amit4mchiba commented 6 years ago

Dear Koren, Walenz and Phillippy,

This is not an issue but rather query on canu bundling phasing ability for the diploid genome. I attended PAG2018 conference and had nice conversation with Koren after discussion on cow genome phasing to get two genomes by sequencing one. We discussed that how this approach performance is better than Falcon-unzip and I was very impressed with the neat phasing of the genome. I am right now working on a plant genome (estimated genome size 400Mbps), and have 123x genome coverage using PacBio reads. I used Canu to assemble this genome and finally got entire genome in 200 contigs with N50 as 10.3Mbps. So, I am very excited and wanted to go next step and phase the genome. My species is a diploid and has 2x11 chromosome numbers. I think that repetition and heterozygosity which is characteristic for any plant genome is very very minimal for my species and that's an advantage. Based on my reading, I have not seen any plant genome which has contig over 10Mbps with PacBio reads alone, of course adding Hi-C and other datasets do allow to go beyond 60-70Mbps. Therefore, going to resolve my genome further makes lots of sense. I remember that you mentioned the approach to phase genome using canu uses meryl output. I was wondering if you are going to share scripts or method to phase the genome, that would be really really great.

skoren commented 6 years ago

The method we presented at PAG relies on parent information so you'd have to have Illumina sequence for the parents of your sample. The method works on at least human-level divergence (0.1%) but is best over about 0.5% divergence.

If you don't have parent information you'd have to use more traditional phasing approaches like HapCut2 or Whatshap both of which can use PacBio reads for phasing SNPs across the genome. You could then bin the reads based on their SNPs and assemble haplotigs for each of the contigs you have in your assembly. These will necessarily be less than or equal to the contigs since it may not be possible to phase across the full contig. This was the method used in the recent AK1 genome, you can see more details in that paper: https://www.nature.com/articles/nature20098.

amit4mchiba commented 6 years ago

Thanks a lot Koren for your reply. Unfortunately, we do not know the parents in this case. So, I guess going for traditional phasing would be the way to go as you suggested. The only confusion I have is why phasing then improves overall assembly. I mean, I have before performed genome assembly using Falcon, and Falcon-unzip, and I have found assembly resulting from Falcon-unzip results in improved assembly statistics then Falcon alone. I agree that Falcon-unzips takes Falcon output, and in principle, Unzipping should have same number of contigs as in the output of Falcon, yet I have seen unzipping giving me better results. Thats why I was excited about your approach and thought that the oeverall would get phased out and get improved. One way of assembly could improved by unzipping (I mean using Falcon-unzipping) could be due to binning reads based on SNPs. This may probably reduce overall repeated or non-unique sequence datasets and may resolve the path/graph and hence improved assembly. I am not sure if I am wrong.

skoren commented 6 years ago

FALCON-Unzip doesn't join contigs already built by falcon but it will remove redundant sequences and decrease the assembly size/# contigs which will change the N50 (but should have minor effects on NG50). The primary output of FALCON-unzip is also not phased over the full contig length while it is with the parental approach.

amit4mchiba commented 6 years ago

Great. Thank you so much. Now it is much clear.

I will then try phasing through HapCut2 and will post if smthing interesting comes up. Thank you for the advise.

cschin commented 6 years ago

@amit4mchiba the algorithm behind the FALCON-Unzip is not just simply phasing SNPs. Depending on genome, some of them have various degree structure variations. The algorithm is designed to handle SVs and SNPs directly in assembly graph. Yet again, depending on the degree of heterozygosity, different strategy might be easier from computation point of view.

skoren commented 6 years ago

Jason is correct, to clarify what I was suggesting is not just to phase SNPs but to use SNPs to assign reads to a haplotype. This is essentially what FALCON-unzip is doing under the hood. You can then assemble haplotype-specific read sets, whether you include unclassified reads will control whether you introduce switches in homozygous regions or not.

amit4mchiba commented 6 years ago

Hi Koren and Jason,

I am back with some update on my genome assembly and to seek your advice. I finished genome assembly using Canu and Falcon-unzip for my plant species and have got brilliant results. My plant genome size is 400Mb and its a diploid. Using Canu and with a genome coverage of 120x of PacBio reads, I got complete genome within 243 contigs with N50 as 9.83Mb. Falcon-unzip results were better in terms of average contig length, and I got entire genome within 197 contigs with N50 value as 6.9Mb. I tried using HapCut2 approach to got SNPs and then to get phasing, but not successful. On the other hand, Falcon-unzip does not really provide phased genome but rather in form of the primary and associated contig (please correct me if I am wrong) and not like in form of chromosomes. I was wondering if there are some ways through which I can completely phase genome as Koren described in the Pag 2018. Also, I am not sure if I could use Canu output to use in Falcon-unzip to get phasing. I know that Falcon-unzip requires Falcon results as input to perform phasing but I was wondering if there is some ways or some trick that I can change the file format to match Falcon-unzip requirement. I am in process of acquiring Hi-C data for my genome, but I guess phasing out genome should further help me to get nicer assembly after including Hi-C data. Please advice me.

skoren commented 6 years ago

The larger # of contigs in the Canu assembly is likely because it's not separating them into primary and secondary. In regions of divergence you'll get both haplotypes in the primary assembly. You can see this if the assembly size is larger than the 400mb expected size.

Falcon-unzip is locally phased. That is, the primary contigs have blocks of phased and homozygous regions and it may switch between haplotypes. See the 10XG description of a psedohaplotype: https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/output/generating. The alts are alternates to the phased blocks included in the primaries. So it's phased but it's not phased across the entirety of a contig or chromosome.

You're not going to get a phased chromosome from just PacBio data, it's limited by read length and diversity. Again, as I said before what I described at PAG is based on having parental or ancestral information, if you don't have that you're not going to get complete phasing across the full genome. If you get other data types, (10XG or HiC) you can use these to scaffold and get long phase blocks with HapCut2 and assign reads to a haplotype. This was the approach in the Korean genome paper: https://www.nature.com/articles/nature20098. There is software associated with that paper which can assign PacBio reads to a haplotype based on a SNP list from HapCut2. You could also cut/rejoin the Falcon-unzip primary contigs to make the phase consistent with the HiC data but I'm not aware of any software to do that.

amit4mchiba commented 6 years ago

Thank you Koren for your prompt reply. Yeah, indeed the genome size i got was 440Mb compared to expected genome size of 400Mb. So, I agree with you. Now if I have understood you correctly, what you suggested is to first perform scaffolding using Hi-C or 10xG sequencing and then use HapCut2 tool to assign haplotype based on SNP list. I agree that PacBio reads have limitation in terms of read length and diversity. What I have also done is that I have acquired Bionano mapping data for my plant species which I am analyzing now. Initially, my plan was to perform assembly using PacBio, then polish assembly using Illumina (96x coverage data that I have already acquired) reads, then perform phasing (I thought the present assembly has long enough phase block to get me haplotype) using HapCut2, then scaffolding through Hi-C, and then correction of assembly and SVA analysis using Bionano data. Thats what I was thinking of as my assembly pipeline. But after your reply, I am wondering if I should rather perform hybrid assembly of Pacbio assembly together with Bionano, and then phase it using HapCut2 tool and then perform scaffold using Hi-C data to the phased genome. Thats should get me better phase block to analyze SNPs. About cut/rejoin the Falcon-unzip primary contigs to make phase consistent with Hi-C data, yeah there are no tools available. I am not an expert of programming or writing codes, but I will try to find out if I could get help of sm programmer to implement what you just suggested. I really wish if you could get sm tool developed for doing exactly the same thing in the future, that would be so amazing. Thank you so much for your comments.

skoren commented 6 years ago

I would suggest getting the most contiguous/correct assembly you can before running HapCut2. HapCut2 expects a collapsed single-haplotype assembly and won't generate a phase block longer than your scaffold. So your assembly pipeline sounds fine, just don't do phasing until the end after BioNano/validation.

amit4mchiba commented 6 years ago

Thank you so much. I will try to post my experience and trouble here once I get to some advanced stage