marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

Trio bin F1 ONT reads using parental Pacbio HIFI data? #2081

Closed TrentPrall closed 2 years ago

TrentPrall commented 2 years ago

Hello,

I have a trio dataset with 30x coverage of ONT long reads and 30x coverage of pacbio HIFI data for all three samples in the trio. Is it possible to bin the F1 ONT reads and/or F1 HIFI reads by substituting parental HIFI reads in place of parental Illumina reads? If so, how would you go about running this?

Forgive me if this has already been addressed but I haven't seen an answer in the issues/documentation.

Thank you.

skoren commented 2 years ago

Yes, HiFi reads should work for binning. You can just run the same way as you would with Illumina reads, just specify the parent HiFi reads for the haplotyping. I'd run just haplotyping to bin the ONT and HiFi reads separately (see the documentation notes on how to run trio-binning for HiFi data: https://canu.readthedocs.io/en/latest/quick-start.html#trio-binning-assembly).

TrentPrall commented 2 years ago

Thank you for the info. I have tried following the work around instructions for running canu-trio with pb hifi data but when I run the following command there are no scripts created and the data isn't binned. Any thoughts to what may be occurring?

docker run -it -u 0 -v $(pwd):/scratch -w /scratch staphb/canu:latest \
canu -haplotype \
-p cy0695.pacbio \
-genomeSize=1m \
-haplotypecy0355 cy0355.pacbio.kir.fastq.gz \
-haplotypecy0161 cy0161.pacbio.kir.fastq.gz \
-pacbio-raw cy0695.pacbio.kir.fastq.gz

Here's the error:

-- BEGIN HAPLOTYPING
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'hap' concurrent execution on Wed Feb 16 05:10:43 2022 with 4666.686 GB free disk space (1 processes; 10 concurrently)

    cd haplotype
    ./splitHaplotype.sh 1 > ./splitHaplotype.000001.out 2>&1

-- Finished on Wed Feb 16 05:10:44 2022 (one second) with 4666.686 GB free disk space
----------------------------------------
--
-- Haplotyping jobs failed, retry.
--   job 1 FAILED.
--
--
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'hap' concurrent execution on Wed Feb 16 05:10:44 2022 with 4666.686 GB free disk space (1 processes; 10 concurrently)

    cd haplotype
    ./splitHaplotype.sh 1 > ./splitHaplotype.000001.out 2>&1

-- Finished on Wed Feb 16 05:10:44 2022 (in the blink of an eye) with 4666.686 GB free disk space
----------------------------------------
--
-- Haplotyping jobs failed, tried 2 times, giving up.
--   job 1 FAILED.
--

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT: Disk space available:  4666.686 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (haplotype/haplotype.log.WORKING):
ABORT:
ABORT:

Thanks again.

skoren commented 2 years ago

What's in the splitHaplotype.000001.out log?

TrentPrall commented 2 years ago

Here is the log:

cat splitHaplotype.000001.out 

Found perl:
   /usr/bin/perl
   This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_292"

Found canu:
   /canu-2.2/bin/canu
   canu 2.2

Running job 1 based on command line options.
--
-- Loading haplotype data, using up to 4 GB memory for each.
--

For 69656 distinct 15-mers (with 11 bits used for indexing and 19 bits for tags):
    0.000 GB memory for kmer indices -         2048 elements 64 bits wide)
    0.000 GB memory for kmer tags    -        69656 elements 19 bits wide)
    0.000 GB memory for kmer values  -        69656 elements  7 bits wide)
    0.000 GB memory

Will load 69656 kmers.  Skipping 291816 (too low) and 0 (too high) kmers.
Allocating space for 85784 suffixes of 19 bits each -> 1629896 bits (0.000 GB) in blocks of 32.000 MB
                     85784 values   of 7 bits each -> 600488 bits (0.000 GB) in blocks of 32.000 MB
Loaded 69656 kmers.  Skipped 291816 (too low) and 0 (too high) kmers.
--   loaded 69656 kmers.

For 0 distinct 15-mers (with 0 bits used for indexing and 30 bits for tags):
    0.000 GB memory for kmer indices -            1 elements 64 bits wide)
    0.000 GB memory for kmer tags    -            0 elements 30 bits wide)
    0.000 GB memory for kmer values  -            0 elements  0 bits wide)
  2147483648.000 GB memory

--   loaded 0 kmers.
-- Data loaded.
--
-- Processing reads in batches of 100 reads each.
--

Failed with 'Segmentation fault'; backtrace (libbacktrace):

Failed with 'Segmentation fault'; backtrace (libbacktrace):
utility/src/utility/system-stackTrace.C::82 in _Z17AS_UTL_catchCrashiP7siginfoPv()
Segmentation fault (core dumped)
skoren commented 2 years ago

It looks like one of your haplotypes is empty (or at least had no kmers). That might be an issue with the input data. Post the splitHaplotype.*.shscript and the contents of your 0-kmers, specifically the starts of the *statistics files.

TrentPrall commented 2 years ago
head -n 50 reads-cy0161.statistics 
Number of 15-mers that are:
  unique                 311436  (exactly one instance of the kmer is in the input)
  distinct              1287991  (non-redundant kmer sequences in the input)
  present              15635665  (...)
  missing            1072453833  (non-redundant kmer sequences not in the input)

             number of   cumulative   cumulative     presence
              distinct     fraction     fraction   in dataset
frequency        kmers     distinct        total       (1e-6)
--------- ------------ ------------ ------------ ------------
        1       311436       0.2418       0.0199     0.063956
        2        32440       0.2670       0.0241     0.127913
        3        18323       0.2812       0.0276     0.191869
        4        22164       0.2984       0.0333     0.255825
        5        34147       0.3249       0.0442     0.319782
        6        40611       0.3565       0.0598     0.383738
        7        47606       0.3934       0.0811     0.447694
        8        64480       0.4435       0.1141     0.511651
        9        66643       0.4952       0.1524     0.575607
       10        77725       0.5556       0.2021     0.639563
       11        82594       0.6197       0.2602     0.703520
       12        78845       0.6809       0.3207     0.767476
       13        72853       0.7375       0.3813     0.831432
       14        68789       0.7909       0.4429     0.895389
       15        50248       0.8299       0.4911     0.959345
       16        32359       0.8550       0.5242     1.023302
       17        28222       0.8769       0.5549     1.087258
       18        20734       0.8930       0.5788     1.151214
       19        14685       0.9044       0.5966     1.215171
       20        11536       0.9134       0.6114     1.279127
       21         8254       0.9198       0.6225     1.343083
       22         7088       0.9253       0.6324     1.407040
       23         5526       0.9296       0.6406     1.470996
       24         5307       0.9337       0.6487     1.534952
       25         5573       0.9380       0.6576     1.598909
       26         5957       0.9427       0.6675     1.662865
       27         5524       0.9470       0.6771     1.726821
       28         5229       0.9510       0.6864     1.790778
       29         5436       0.9552       0.6965     1.854734
       30         4676       0.9589       0.7055     1.918690
       31         3212       0.9614       0.7119     1.982647
       32         2673       0.9634       0.7173     2.046603
       33         2719       0.9655       0.7231     2.110559
       34         2131       0.9672       0.7277     2.174516
       35         1785       0.9686       0.7317     2.238472
       36         1176       0.9695       0.7344     2.302428
       37         1217       0.9704       0.7373     2.366385
       38         1087       0.9713       0.7399     2.430341
       39          930       0.9720       0.7423     2.494297
head -n 50 reads-cy0355.statistics
Number of 15-mers that are:
  unique                 319722  (exactly one instance of the kmer is in the input)
  distinct              1322622  (non-redundant kmer sequences in the input)
  present              15087376  (...)
  missing            1072419202  (non-redundant kmer sequences not in the input)

             number of   cumulative   cumulative     presence
              distinct     fraction     fraction   in dataset
frequency        kmers     distinct        total       (1e-6)
--------- ------------ ------------ ------------ ------------
        1       319722       0.2417       0.0212     0.066281
        2        36385       0.2692       0.0260     0.132561
        3        28710       0.2910       0.0317     0.198842
        4        34159       0.3168       0.0408     0.265122
        5        50675       0.3551       0.0576     0.331403
        6        66782       0.4056       0.0841     0.397683
        7        58130       0.4495       0.1111     0.463964
        8        56072       0.4919       0.1408     0.530245
        9        63941       0.5403       0.1790     0.596525
       10        67934       0.5916       0.2240     0.662806
       11        62863       0.6392       0.2698     0.729086
       12        66481       0.6894       0.3227     0.795367
       13        64182       0.7380       0.3780     0.861648
       14        63363       0.7859       0.4368     0.927928
       15        58479       0.8301       0.4950     0.994209
       16        32991       0.8550       0.5299     1.060489
       17        28065       0.8762       0.5616     1.126770
       18        24010       0.8944       0.5902     1.193050
       19        18641       0.9085       0.6137     1.259331
       20        14535       0.9195       0.6329     1.325612
       21        11662       0.9283       0.6492     1.391892
       22         9991       0.9358       0.6637     1.458173
       23         8010       0.9419       0.6760     1.524453
       24         6556       0.9469       0.6864     1.590734
       25         5153       0.9508       0.6949     1.657014
       26         5185       0.9547       0.7039     1.723295
       27         3678       0.9575       0.7104     1.789576
       28         3404       0.9600       0.7168     1.855856
       29         2971       0.9623       0.7225     1.922137
       30         3146       0.9647       0.7287     1.988417
       31         2478       0.9665       0.7338     2.054698
       32         2023       0.9681       0.7381     2.120978
       33         1731       0.9694       0.7419     2.187259
       34         1798       0.9707       0.7459     2.253540
       35         1936       0.9722       0.7504     2.319820
       36         1394       0.9732       0.7538     2.386101
       37         1074       0.9741       0.7564     2.452381
       38         1393       0.9751       0.7599     2.518662
       39         1287       0.9761       0.7632     2.584943
       40         1305       0.9771       0.7667     2.651223
 cat splitHaplotype.sh
#!/bin/sh

#  Path to Canu.

bin="/canu-2.2/bin"

#  Report paths.

echo ""
echo "Found perl:"
echo "  " `which perl`
echo "  " `perl --version | grep version`
echo ""
echo "Found java:"
echo "  " `which java`
echo "  " `java -showversion 2>&1 | head -n 1`
echo ""
echo "Found canu:"
echo "  " $bin/canu
echo "  " `$bin/canu -version`
echo ""

#  Environment for any object storage.

export CANU_OBJECT_STORE_CLIENT=
export CANU_OBJECT_STORE_CLIENT_UA=
export CANU_OBJECT_STORE_CLIENT_DA=
export CANU_OBJECT_STORE_NAMESPACE=
export CANU_OBJECT_STORE_PROJECT=

#  Discover the job ID to run, from either a grid environment variable and a
#  command line offset, or directly from the command line.
#
if [ x$CANU_LOCAL_JOB_ID = x -o x$CANU_LOCAL_JOB_ID = xundefined -o x$CANU_LOCAL_JOB_ID = x0 ]; then
  baseid=$1
  offset=0
else
  baseid=$CANU_LOCAL_JOB_ID
  offset=$1
fi
if [ x$offset = x ]; then
  offset=0
fi
if [ x$baseid = x ]; then
  echo Error: I need CANU_LOCAL_JOB_ID set, or a job index on the command line.
  exit
fi
jobid=`expr -- $baseid + $offset`
if [ x$baseid = x0 ]; then
  echo Error: jobid 0 is invalid\; I need CANU_LOCAL_JOB_ID set, or a job index on the command line.
  exit
fi
if [ x$CANU_LOCAL_JOB_ID = x ]; then
  echo Running job $jobid based on command line options.
else
  echo Running job $jobid based on CANU_LOCAL_JOB_ID=$CANU_LOCAL_JOB_ID and offset=$offset.
fi

if [ $jobid -gt 1 ]; then
  echo Error: Only 1 job, you asked for $jobid.
  exit 1
fi

#  If the unknown haplotype assignment exists, we're done.

if [ -e ./haplotype.log ] ; then
  ho Read to haplotype assignment already exists.
exit 0
fi

#  Assign reads to haplotypes.

$bin/splitHaplotype \
  l 1000 \
  hreads 4 \
  emory  8 \
   /scratch/cy0695.pacbio.kir.fastq.gz \
   ./0-kmers/haplotype-cy0161.meryl ./0-kmers/reads-cy0161.statistics ./haplotype-cy0161.fasta.gz \
   ./0-kmers/haplotype-cy0355.meryl ./0-kmers/reads-cy0355.statistics ./haplotype-cy0355.fasta.gz \
-A ./haplotype-unknown.fasta.gz \
> haplotype.log.WORKING \
&& \
mv -f ./haplotype.log.WORKING ./haplotype.log

exit 0
brianwalenz commented 2 years ago

I think what happened is that the second haplotype has no mers unique to it. These are in 0-kmers/haplotype-cy0161.meryl and 0-kmers/haplotype-cy0161.meryl. You can confirm this by generating statistics for each with {path_to_canu}/bin/meryl statistics {hap}.meryl | head -n 20. The stats shown before were for the reads themselves, before filtering for unique-to-haplotype kmers (and they look fine).

What to do next, I don't know.

skoren commented 2 years ago

I'd suggest running Merqury (https://github.com/marbl/merqury) and getting statistics on the resulting k-mer databases just as a sanity check on Canu's run. It will also give more intermediate databases than Canu to check counts. The resulting databases are compatible with Canu so you could still run splitHaplotype but give it the DBs from merqury.

If indeed there are no k-mers unique to one parent, that implies an issue with the parental sequencing or incorrect parentage. If the parental data is very low coverage (<15x) it may help to get more sequencing. Otherwise, it's likely you can't use these parents to trio-bin the data.

skoren commented 2 years ago

Idle