marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

triocanu results: a large number of reads were undistinguishable from haplotypes #2345

Open cheninouc opened 1 day ago

cheninouc commented 1 day ago

Hi Thanks for developing this tool. I now have resequencing data for two parents: P19_clean_1.fq.gz,P19_clean_2.fq.gz,P48_clean_1.fq.gz,P48_clean_2.fq.gz, and the offspring ONT (R10.4.1) data: F1.fq.gz, I want to use triocanu for haplotype assembly:

canu -p F1_trio -d canu_trio_assembly \
     genomeSize=600m \
     -nanopore F1.fq.gz \
     -haplotypeP19 ../data/P19_clean.fq.gz \
     -haplotypeP48 ../data/P48_clean.fq.gz \
     useGrid=false

I got the result of splitHaplotype:

drwxrwxr-x 3 chcg chcg       4096 Sep 27 19:01 ./
drwxrwxr-x 7 chcg chcg       4096 Sep 28 11:39 ../
drwxrwxr-x 6 chcg chcg       4096 Sep 27 18:53 0-kmers/
-rw-rw-r-- 1 chcg chcg        782 Sep 27 19:01 haplotype.log
-rw-rw-r-- 1 chcg chcg   25287684 Sep 27 19:01 haplotype-P19.fasta.gz
-rw-rw-r-- 1 chcg chcg   39011298 Sep 27 19:01 haplotype-P48.fasta.gz
-rw-rw-r-- 1 chcg chcg 3889944485 Sep 27 19:01 haplotype-unknown.fasta.gz
-rw-rw-r-- 1 chcg chcg          0 Sep 27 19:01 haplotyping.success
-rw-rw-r-- 1 chcg chcg       1955 Sep 27 19:01 splitHaplotype.000001.out
-rwxr-xr-x 1 chcg chcg       2289 Sep 27 18:53 splitHaplotype.sh*

What is the reason for a large number of sequences that cannot distinguish haplotypes? Because my species has a relatively high heterozygosity, the parental resequencing data is only 10-15X, is it the reason for the low parental data?

Thanks in advance.

cheninouc commented 1 day ago

In addition, the P19_clean.fq.gz and P48_clean.fq.gz file is a direct combination of the two-ended sequencing reads:

zcat  P19_clean_1.fq.gz P19_clean_2.fq.gz  | gzip - > P19_clean.fq.gz 
zcat  P48_clean_1.fq.gz P48_clean_2.fq.gz  | gzip - > P48_clean.fq.gz 
skoren commented 1 day ago

The unknown reads are those without any marker k-mers. It's possible your parents sequencing is too spare or the haplotypes are too closely related. Have you looked at the F1 stats in genomescope and the parental marker counts with merqury? Post the *.out and *.log files from your run.

cheninouc commented 12 hours ago

The contents of the two files are as follows:

splitHaplotype.000001.out:

Found perl:
   /public/home/chcg/anaconda3/envs/mamba/envs/canu/bin/perl
   This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Found java:
   /public/home/chcg/anaconda3/envs/mamba/envs/canu/bin/java
   openjdk version "11.0.13" 2021-10-19

Found canu:
   /public/home/chcg/anaconda3/envs/mamba/envs/canu/bin/canu
   canu 2.2

Running job 1 based on command line options.
--
-- Loading haplotype data, using up to 6 GB memory for each.
--

For 626 distinct 20-mers (with 6 bits used for indexing and 34 bits for tags):
    0.000 GB memory for kmer indices -           64 elements 64 bits wide)
    0.000 GB memory for kmer tags    -          626 elements 34 bits wide)
    0.000 GB memory for kmer values  -          626 elements 12 bits wide)
    0.000 GB memory

Will load 626 kmers.  Skipping 256996778 (too low) and 0 (too high) kmers.
Allocating space for 16754 suffixes of 34 bits each -> 569636 bits (0.000 GB) in blocks of 32.000 MB
                     16754 values   of 12 bits each -> 201048 bits (0.000 GB) in blocks of 32.000 MB
Loaded 626 kmers.  Skipped 256996778 (too low) and 0 (too high) kmers.
--   loaded 626 kmers.

For 1687 distinct 20-mers (with 6 bits used for indexing and 34 bits for tags):
    0.000 GB memory for kmer indices -           64 elements 64 bits wide)
    0.000 GB memory for kmer tags    -         1687 elements 34 bits wide)
    0.000 GB memory for kmer values  -         1687 elements 14 bits wide)
    0.000 GB memory

Will load 1687 kmers.  Skipping 388173722 (too low) and 0 (too high) kmers.
Allocating space for 17815 suffixes of 34 bits each -> 605710 bits (0.000 GB) in blocks of 32.000 MB
                     17815 values   of 14 bits each -> 249410 bits (0.000 GB) in blocks of 32.000 MB
Loaded 1687 kmers.  Skipped 388173722 (too low) and 0 (too high) kmers.
--   loaded 1687 kmers.
-- Data loaded.
--
-- Processing reads in batches of 100 reads each.
--
-- Bye.

haplotype.log:

--  Haplotype './0-kmers/haplotype-P19.meryl':
--   use kmers with frequency at least 1009.
--  Haplotype './0-kmers/haplotype-P48.meryl':
--   use kmers with frequency at least 998.
-- Begin    processing file /public/home/chcg/dowload/BC202408553/BC202408553-ONT-ul-1samples/kw1-1M/pass.all.fq.gz
-- Finished processing file /public/home/chcg/dowload/BC202408553/BC202408553-ONT-ul-1samples/kw1-1M/pass.all.fq.gz with 458589 records
--
--     1907 reads     85737412 bases written to haplotype file ./haplotype-P19.fasta.gz.
--     3060 reads    137291711 bases written to haplotype file ./haplotype-P48.fasta.gz.
--   441892 reads  12043284403 bases written to haplotype file ./haplotype-unknown.fasta.gz.
--
--    11730 reads      9895948 bases filtered for being too short.