marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Recommendations in plant genome #2330

Closed enriquepola1996 closed 3 months ago

enriquepola1996 commented 4 months ago

Hello dear Canu developers,

I am assembling a plant genome for the first time and would like to ask for parameters recommendations. The genome is from a diploid plant with high heterozygosity and my data are from PacBio (80Gb of fasta.gz). Later, I would like to perform genome phashing. I set the size taking as a reference a genome generated with Illumina with 4.22Gb. I'm thinking of trying something like this:

module load canu/1.9

canu \
  useGrid=false \
  -p assembly \
  -d canu_plant \
  genomeSize=4.2g \
  -pacbio-raw all_pacbio.fasta.gz \
  corOutCoverage=200 \
  "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"

Do you think these parameters would be ideal? Thanks so much.

skoren commented 4 months ago

No recommendations beyond what's on the FAQ which you already have. I'd recommend updating to a more recent version than 1.9. Keep in mind for phasing that canu won't purge duplicates so you'll get both haplotypes in the assembly output (though some of them may be flagged as bubbles in the fasta defline but very diverged sequences won't show up as bubbles, you'd need a tool like purge_dups to separate primary and alt see the FAQ again and some of the closed issues on suggested usage). Also, all the contigs/alts will not guarantee to preserve phasing, they will be pseudohaplotypes (http://lh3.github.io/2021/04/17/concepts-in-phased-assemblies) since the correction doesn't guarantee it will preserve phase in similar regions. If you want to try to phase them you may want to look at Falcon-phase/unzip.

skoren commented 3 months ago

Idle