Closed TrentPrall closed 2 years ago
Yes, HiFi reads should work for binning. You can just run the same way as you would with Illumina reads, just specify the parent HiFi reads for the haplotyping. I'd run just haplotyping to bin the ONT and HiFi reads separately (see the documentation notes on how to run trio-binning for HiFi data: https://canu.readthedocs.io/en/latest/quick-start.html#trio-binning-assembly).
Thank you for the info. I have tried following the work around instructions for running canu-trio with pb hifi data but when I run the following command there are no scripts created and the data isn't binned. Any thoughts to what may be occurring?
docker run -it -u 0 -v $(pwd):/scratch -w /scratch staphb/canu:latest \
canu -haplotype \
-p cy0695.pacbio \
-genomeSize=1m \
-haplotypecy0355 cy0355.pacbio.kir.fastq.gz \
-haplotypecy0161 cy0161.pacbio.kir.fastq.gz \
-pacbio-raw cy0695.pacbio.kir.fastq.gz
Here's the error:
-- BEGIN HAPLOTYPING
--
-- Running jobs. First attempt out of 2.
----------------------------------------
-- Starting 'hap' concurrent execution on Wed Feb 16 05:10:43 2022 with 4666.686 GB free disk space (1 processes; 10 concurrently)
cd haplotype
./splitHaplotype.sh 1 > ./splitHaplotype.000001.out 2>&1
-- Finished on Wed Feb 16 05:10:44 2022 (one second) with 4666.686 GB free disk space
----------------------------------------
--
-- Haplotyping jobs failed, retry.
-- job 1 FAILED.
--
--
-- Running jobs. Second attempt out of 2.
----------------------------------------
-- Starting 'hap' concurrent execution on Wed Feb 16 05:10:44 2022 with 4666.686 GB free disk space (1 processes; 10 concurrently)
cd haplotype
./splitHaplotype.sh 1 > ./splitHaplotype.000001.out 2>&1
-- Finished on Wed Feb 16 05:10:44 2022 (in the blink of an eye) with 4666.686 GB free disk space
----------------------------------------
--
-- Haplotyping jobs failed, tried 2 times, giving up.
-- job 1 FAILED.
--
ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: Disk space available: 4666.686 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (haplotype/haplotype.log.WORKING):
ABORT:
ABORT:
Thanks again.
What's in the splitHaplotype.000001.out log?
Here is the log:
cat splitHaplotype.000001.out
Found perl:
/usr/bin/perl
This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi
Found java:
/usr/bin/java
openjdk version "1.8.0_292"
Found canu:
/canu-2.2/bin/canu
canu 2.2
Running job 1 based on command line options.
--
-- Loading haplotype data, using up to 4 GB memory for each.
--
For 69656 distinct 15-mers (with 11 bits used for indexing and 19 bits for tags):
0.000 GB memory for kmer indices - 2048 elements 64 bits wide)
0.000 GB memory for kmer tags - 69656 elements 19 bits wide)
0.000 GB memory for kmer values - 69656 elements 7 bits wide)
0.000 GB memory
Will load 69656 kmers. Skipping 291816 (too low) and 0 (too high) kmers.
Allocating space for 85784 suffixes of 19 bits each -> 1629896 bits (0.000 GB) in blocks of 32.000 MB
85784 values of 7 bits each -> 600488 bits (0.000 GB) in blocks of 32.000 MB
Loaded 69656 kmers. Skipped 291816 (too low) and 0 (too high) kmers.
-- loaded 69656 kmers.
For 0 distinct 15-mers (with 0 bits used for indexing and 30 bits for tags):
0.000 GB memory for kmer indices - 1 elements 64 bits wide)
0.000 GB memory for kmer tags - 0 elements 30 bits wide)
0.000 GB memory for kmer values - 0 elements 0 bits wide)
2147483648.000 GB memory
-- loaded 0 kmers.
-- Data loaded.
--
-- Processing reads in batches of 100 reads each.
--
Failed with 'Segmentation fault'; backtrace (libbacktrace):
Failed with 'Segmentation fault'; backtrace (libbacktrace):
utility/src/utility/system-stackTrace.C::82 in _Z17AS_UTL_catchCrashiP7siginfoPv()
Segmentation fault (core dumped)
It looks like one of your haplotypes is empty (or at least had no kmers). That might be an issue with the input data. Post the splitHaplotype.*.sh
script and the contents of your 0-kmers
, specifically the starts of the *statistics
files.
head -n 50 reads-cy0161.statistics
Number of 15-mers that are:
unique 311436 (exactly one instance of the kmer is in the input)
distinct 1287991 (non-redundant kmer sequences in the input)
present 15635665 (...)
missing 1072453833 (non-redundant kmer sequences not in the input)
number of cumulative cumulative presence
distinct fraction fraction in dataset
frequency kmers distinct total (1e-6)
--------- ------------ ------------ ------------ ------------
1 311436 0.2418 0.0199 0.063956
2 32440 0.2670 0.0241 0.127913
3 18323 0.2812 0.0276 0.191869
4 22164 0.2984 0.0333 0.255825
5 34147 0.3249 0.0442 0.319782
6 40611 0.3565 0.0598 0.383738
7 47606 0.3934 0.0811 0.447694
8 64480 0.4435 0.1141 0.511651
9 66643 0.4952 0.1524 0.575607
10 77725 0.5556 0.2021 0.639563
11 82594 0.6197 0.2602 0.703520
12 78845 0.6809 0.3207 0.767476
13 72853 0.7375 0.3813 0.831432
14 68789 0.7909 0.4429 0.895389
15 50248 0.8299 0.4911 0.959345
16 32359 0.8550 0.5242 1.023302
17 28222 0.8769 0.5549 1.087258
18 20734 0.8930 0.5788 1.151214
19 14685 0.9044 0.5966 1.215171
20 11536 0.9134 0.6114 1.279127
21 8254 0.9198 0.6225 1.343083
22 7088 0.9253 0.6324 1.407040
23 5526 0.9296 0.6406 1.470996
24 5307 0.9337 0.6487 1.534952
25 5573 0.9380 0.6576 1.598909
26 5957 0.9427 0.6675 1.662865
27 5524 0.9470 0.6771 1.726821
28 5229 0.9510 0.6864 1.790778
29 5436 0.9552 0.6965 1.854734
30 4676 0.9589 0.7055 1.918690
31 3212 0.9614 0.7119 1.982647
32 2673 0.9634 0.7173 2.046603
33 2719 0.9655 0.7231 2.110559
34 2131 0.9672 0.7277 2.174516
35 1785 0.9686 0.7317 2.238472
36 1176 0.9695 0.7344 2.302428
37 1217 0.9704 0.7373 2.366385
38 1087 0.9713 0.7399 2.430341
39 930 0.9720 0.7423 2.494297
head -n 50 reads-cy0355.statistics
Number of 15-mers that are:
unique 319722 (exactly one instance of the kmer is in the input)
distinct 1322622 (non-redundant kmer sequences in the input)
present 15087376 (...)
missing 1072419202 (non-redundant kmer sequences not in the input)
number of cumulative cumulative presence
distinct fraction fraction in dataset
frequency kmers distinct total (1e-6)
--------- ------------ ------------ ------------ ------------
1 319722 0.2417 0.0212 0.066281
2 36385 0.2692 0.0260 0.132561
3 28710 0.2910 0.0317 0.198842
4 34159 0.3168 0.0408 0.265122
5 50675 0.3551 0.0576 0.331403
6 66782 0.4056 0.0841 0.397683
7 58130 0.4495 0.1111 0.463964
8 56072 0.4919 0.1408 0.530245
9 63941 0.5403 0.1790 0.596525
10 67934 0.5916 0.2240 0.662806
11 62863 0.6392 0.2698 0.729086
12 66481 0.6894 0.3227 0.795367
13 64182 0.7380 0.3780 0.861648
14 63363 0.7859 0.4368 0.927928
15 58479 0.8301 0.4950 0.994209
16 32991 0.8550 0.5299 1.060489
17 28065 0.8762 0.5616 1.126770
18 24010 0.8944 0.5902 1.193050
19 18641 0.9085 0.6137 1.259331
20 14535 0.9195 0.6329 1.325612
21 11662 0.9283 0.6492 1.391892
22 9991 0.9358 0.6637 1.458173
23 8010 0.9419 0.6760 1.524453
24 6556 0.9469 0.6864 1.590734
25 5153 0.9508 0.6949 1.657014
26 5185 0.9547 0.7039 1.723295
27 3678 0.9575 0.7104 1.789576
28 3404 0.9600 0.7168 1.855856
29 2971 0.9623 0.7225 1.922137
30 3146 0.9647 0.7287 1.988417
31 2478 0.9665 0.7338 2.054698
32 2023 0.9681 0.7381 2.120978
33 1731 0.9694 0.7419 2.187259
34 1798 0.9707 0.7459 2.253540
35 1936 0.9722 0.7504 2.319820
36 1394 0.9732 0.7538 2.386101
37 1074 0.9741 0.7564 2.452381
38 1393 0.9751 0.7599 2.518662
39 1287 0.9761 0.7632 2.584943
40 1305 0.9771 0.7667 2.651223
cat splitHaplotype.sh
#!/bin/sh
# Path to Canu.
bin="/canu-2.2/bin"
# Report paths.
echo ""
echo "Found perl:"
echo " " `which perl`
echo " " `perl --version | grep version`
echo ""
echo "Found java:"
echo " " `which java`
echo " " `java -showversion 2>&1 | head -n 1`
echo ""
echo "Found canu:"
echo " " $bin/canu
echo " " `$bin/canu -version`
echo ""
# Environment for any object storage.
export CANU_OBJECT_STORE_CLIENT=
export CANU_OBJECT_STORE_CLIENT_UA=
export CANU_OBJECT_STORE_CLIENT_DA=
export CANU_OBJECT_STORE_NAMESPACE=
export CANU_OBJECT_STORE_PROJECT=
# Discover the job ID to run, from either a grid environment variable and a
# command line offset, or directly from the command line.
#
if [ x$CANU_LOCAL_JOB_ID = x -o x$CANU_LOCAL_JOB_ID = xundefined -o x$CANU_LOCAL_JOB_ID = x0 ]; then
baseid=$1
offset=0
else
baseid=$CANU_LOCAL_JOB_ID
offset=$1
fi
if [ x$offset = x ]; then
offset=0
fi
if [ x$baseid = x ]; then
echo Error: I need CANU_LOCAL_JOB_ID set, or a job index on the command line.
exit
fi
jobid=`expr -- $baseid + $offset`
if [ x$baseid = x0 ]; then
echo Error: jobid 0 is invalid\; I need CANU_LOCAL_JOB_ID set, or a job index on the command line.
exit
fi
if [ x$CANU_LOCAL_JOB_ID = x ]; then
echo Running job $jobid based on command line options.
else
echo Running job $jobid based on CANU_LOCAL_JOB_ID=$CANU_LOCAL_JOB_ID and offset=$offset.
fi
if [ $jobid -gt 1 ]; then
echo Error: Only 1 job, you asked for $jobid.
exit 1
fi
# If the unknown haplotype assignment exists, we're done.
if [ -e ./haplotype.log ] ; then
ho Read to haplotype assignment already exists.
exit 0
fi
# Assign reads to haplotypes.
$bin/splitHaplotype \
l 1000 \
hreads 4 \
emory 8 \
/scratch/cy0695.pacbio.kir.fastq.gz \
./0-kmers/haplotype-cy0161.meryl ./0-kmers/reads-cy0161.statistics ./haplotype-cy0161.fasta.gz \
./0-kmers/haplotype-cy0355.meryl ./0-kmers/reads-cy0355.statistics ./haplotype-cy0355.fasta.gz \
-A ./haplotype-unknown.fasta.gz \
> haplotype.log.WORKING \
&& \
mv -f ./haplotype.log.WORKING ./haplotype.log
exit 0
I think what happened is that the second haplotype has no mers unique to it. These are in 0-kmers/haplotype-cy0161.meryl
and 0-kmers/haplotype-cy0161.meryl
. You can confirm this by generating statistics for each with {path_to_canu}/bin/meryl statistics {hap}.meryl | head -n 20
. The stats shown before were for the reads themselves, before filtering for unique-to-haplotype kmers (and they look fine).
What to do next, I don't know.
I'd suggest running Merqury (https://github.com/marbl/merqury) and getting statistics on the resulting k-mer databases just as a sanity check on Canu's run. It will also give more intermediate databases than Canu to check counts. The resulting databases are compatible with Canu so you could still run splitHaplotype but give it the DBs from merqury.
If indeed there are no k-mers unique to one parent, that implies an issue with the parental sequencing or incorrect parentage. If the parental data is very low coverage (<15x) it may help to get more sequencing. Otherwise, it's likely you can't use these parents to trio-bin the data.
Idle
Hello,
I have a trio dataset with 30x coverage of ONT long reads and 30x coverage of pacbio HIFI data for all three samples in the trio. Is it possible to bin the F1 ONT reads and/or F1 HIFI reads by substituting parental HIFI reads in place of parental Illumina reads? If so, how would you go about running this?
Forgive me if this has already been addressed but I haven't seen an answer in the issues/documentation.
Thank you.