Closed mictadlo closed 7 years ago
Generally, there's a couple of ways of dealing with the ploidy. Our preferred method is to avoid collapsing the genome so you end up with double (or triple) the genome size as long as your divergence is above about 2% on average. Below this divergence, you'd end up collapsing the variations. The main thing is that your genome is about three times larger than the haploid size so you want higher coverage for assembly. We've used the following parameters for recent polyploid insect populations with good results (assuming you're using PacBio data0:
corOutCoverage=200 errorRate=0.013 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"
This will output more corrected reads (than the default 40x since that would only be <15x per haplotype). The latter option will be more conservative at picking the error rate to use for the assembly to try to maintain haplotype separation. If it works, you'll end up with approximately 3x your haploid genome size. You'll have to do post-processing using gene information or other synteny information to remove the redundancy in this assembly since you'll have 3 copies of large parts of the genome.
The alternative is to try to smash everything together and then do phasing using another approach (like HapCUT2 or whatshap or others). In that case you want to do the opposite, increase the error rate:
corOutCoverage=200 ovlErrorRate=0.15 obtErrorRate=0.15
I will add these options to the documentation.
@skoren
Hi, skoren, thanks for your previously help!
I found I may make a mistake, so I look for help here. For my genome (diploid), total coverage is 60X, and I use default parameter, do you think I should stop the process (I had run a lot of time, I hope other solution)?
Best wishes!
I can't help without much context to go on here. It depends on the heterozygosity in the genome but you can let the current assembly finish and run another one with the FAQ settings.
Hi,
Thanks for your reply! You could image my genome as human, so is it means I have to restart all the assembly processes (I think the read length cutoff is change, so no intermediary files including overlap files could be used), am I right?
Besides, I should take coverage of my data as 30, not 60?
In addition, it contains both RSII ang Sequel data, any other parameter should I take care of?
Any suggestions would be grateful!
Best wishes!
Human is not heterozygous enough to matter so default parameters will work fine as will combining Sequel and RSII data.
Hi Dr. Skoren,
Thank you for your reply!
My genome is diploid and the coverage is 60X, I should take it as 30X or 60X (as 30X would be low coverage)?
The organism is primate and closely related to human, so I think I should start a new a assembly at the same time, in case it is too heterozygous!
Hi, Does Canu have any settings for polyploidy? We trying to assemble a triploid.
Thank you in advance.
Michal