Closed shaferab closed 2 years ago
It's strange to have 4000 jobs for such a small genome and low coverage. Can you post the report file to see what the data/assembly look like?
Consensus typically does have a larger partition because one gets all the small contigs which run faster individually. The steps are all atomic so if it fails, you'd have to restart the consensus from scratch for that job. You can modify the portioning parameters to create more jobs but then you'd have to re-run all the partitions. I'd let the job run and see what progress it is making.
Thanks, yeah I have been monitoring progress and asks; I'm confident I'll get 56/57 finished.
I should clarify perhaps; canu submitted 4300 jobs for the overlapper phase - is that more common? I had previously trimmed the reads. Here is the report.
[shaferab@gra-login1 Sc-pacbio_2nodes]$ cat Sc.report
[TRIMMING/READS]
--
-- In sequence store './Sc.seqStore':
-- Found 27866658 reads.
-- Found 526762853342 bases (195.09 times coverage).
-- Histogram of corrected reads:
--
-- G=526762853342 sum of || length num
-- NG length index lengths || range seqs
-- ----- ------------ --------- ------------ || ------------------- -------
-- 00010 146418 293178 52676307819 || 1156-12583 9973556|--------------------------------------------
-- 00020 87797 754679 105352599017 || 12584-24011 14514741---------------------------------------------------------------
-- 00030 43464 1620381 158028877955 || 24012-35439 1327489|------
-- 00040 24485 3282985 210705147645 || 35440-46867 557061|---
-- 00050 18677 5812040 263381441119 || 46868-58295 302656|--
-- 00060 16309 8845159 316057715387 || 58296-69723 204114|-
-- 00070 14665 12257266 368734001794 || 69724-81151 155621|-
-- 00080 13262 16034317 421410284641 || 81152-92579 127033|-
-- 00090 11513 20264116 474086572273 || 92580-104007 108309|-
-- 00100 1156 27866657 526762853342 || 104008-115435 94085|-
-- 001.000x 27866658 526762853342 || 115436-126863 83503|-
-- || 126864-138291 75657|-
-- || 138292-149719 69024|-
-- || 149720-161147 63564|-
-- || 161148-172575 55134|-
-- || 172576-184003 46713|-
-- || 184004-195431 37359|-
-- || 195432-206859 28051|-
-- || 206860-218287 19543|-
-- || 218288-229715 11511|-
-- || 229716-241143 5946|-
-- || 241144-252571 2667|-
-- || 252572-263999 1206|-
-- || 264000-275427 614|-
-- || 275428-286855 388|-
-- || 286856-298283 308|-
-- || 298284-309711 242|-
-- || 309712-321139 160|-
-- || 321140-332567 119|-
-- || 332568-343995 81|-
-- || 343996-355423 58|-
-- || 355424-366851 34|-
-- || 366852-378279 35|-
-- || 378280-389707 19|-
-- || 389708-401135 12|-
-- || 401136-412563 15|-
-- || 412564-423991 10|-
-- || 423992-435419 5|-
-- || 435420-446847 4|-
-- || 446848-458275 2|-
-- || 458276-469703 2|-
-- || 469704-481131 1|-
-- || 481132-492559 2|-
-- || 492560-503987 0|
-- || 503988-515415 3|-
-- || 515416-526843 0|
-- || 526844-538271 0|
-- || 538272-549699 0|
-- || 549700-561127 0|
-- || 561128-572555 1|-
--
[UNITIGGING/READS]
--
-- In sequence store './Sc.seqStore':
-- Found 27866658 reads.
-- Found 526762853342 bases (195.09 times coverage).
-- Histogram of corrected-trimmed reads:
--
-- G=526762853342 sum of || length num
-- NG length index lengths || range seqs
-- ----- ------------ --------- ------------ || ------------------- -------
-- 00010 146418 293178 52676307819 || 1156-12583 9973556|--------------------------------------------
-- 00020 87797 754679 105352599017 || 12584-24011 14514741---------------------------------------------------------------
-- 00030 43464 1620381 158028877955 || 24012-35439 1327489|------
-- 00040 24485 3282985 210705147645 || 35440-46867 557061|---
-- 00050 18677 5812040 263381441119 || 46868-58295 302656|--
-- 00060 16309 8845159 316057715387 || 58296-69723 204114|-
-- 00070 14665 12257266 368734001794 || 69724-81151 155621|-
-- 00080 13262 16034317 421410284641 || 81152-92579 127033|-
-- 00090 11513 20264116 474086572273 || 92580-104007 108309|-
-- 00100 1156 27866657 526762853342 || 104008-115435 94085|-
-- 001.000x 27866658 526762853342 || 115436-126863 83503|-
-- || 126864-138291 75657|-
-- || 138292-149719 69024|-
-- || 149720-161147 63564|-
-- || 161148-172575 55134|-
-- || 172576-184003 46713|-
-- || 184004-195431 37359|-
-- || 195432-206859 28051|-
-- || 206860-218287 19543|-
-- || 218288-229715 11511|-
-- || 229716-241143 5946|-
-- || 241144-252571 2667|-
-- || 252572-263999 1206|-
-- || 264000-275427 614|-
-- || 275428-286855 388|-
-- || 286856-298283 308|-
-- || 298284-309711 242|-
-- || 309712-321139 160|-
-- || 321140-332567 119|-
-- || 332568-343995 81|-
-- || 343996-355423 58|-
-- || 355424-366851 34|-
-- || 366852-378279 35|-
-- || 378280-389707 19|-
-- || 389708-401135 12|-
-- || 401136-412563 15|-
-- || 412564-423991 10|-
-- || 423992-435419 5|-
-- || 435420-446847 4|-
-- || 446848-458275 2|-
-- || 458276-469703 2|-
-- || 469704-481131 1|-
-- || 481132-492559 2|-
-- || 492560-503987 0|
-- || 503988-515415 3|-
-- || 515416-526843 0|
-- || 526844-538271 0|
-- || 538272-549699 0|
-- || 549700-561127 0|
-- || 561128-572555 1|-
--
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 2885432347 ******************************************************** 0.2116 0.0191
-- 3- 4 3543949771 ********************************************************************** 0.3629 0.0396
-- 5- 7 2515499362 ************************************************* 0.5508 0.0771
-- 8- 11 1466841065 **************************** 0.6916 0.1205
-- 12- 16 858693360 **************** 0.7800 0.1615
-- 17- 22 504164170 ********* 0.8345 0.1977
-- 23- 29 301191793 ***** 0.8674 0.2276
-- 30- 37 204394007 **** 0.8877 0.2520
-- 38- 46 162019995 *** 0.9019 0.2740
-- 47- 56 155273537 *** 0.9134 0.2963
-- 57- 67 194683497 *** 0.9248 0.3233
-- 68- 79 228981723 **** 0.9393 0.3647
-- 80- 92 181699269 *** 0.9560 0.4205
-- 93- 106 105088746 ** 0.9688 0.4704
-- 107- 121 64620907 * 0.9761 0.5035
-- 122- 137 51994552 * 0.9807 0.5273
-- 138- 154 45780784 0.9845 0.5495
-- 155- 172 34479164 0.9878 0.5714
-- 173- 191 23918763 0.9903 0.5896
-- 192- 211 18112832 0.9920 0.6038
-- 212- 232 14311213 0.9933 0.6157
-- 233- 254 11004995 0.9943 0.6261
-- 255- 277 8590246 0.9951 0.6349
-- 278- 301 6887950 0.9958 0.6423
-- 302- 326 5556928 0.9963 0.6489
-- 327- 352 4531961 0.9967 0.6546
-- 353- 379 3754143 0.9970 0.6597
-- 380- 407 3139690 0.9973 0.6642
-- 408- 436 2666075 0.9975 0.6682
-- 437- 466 2287324 0.9977 0.6719
-- 467- 497 1996071 0.9978 0.6753
-- 498- 529 1760775 0.9980 0.6785
-- 530- 562 1563917 0.9981 0.6815
-- 563- 596 1398309 0.9982 0.6843
-- 597- 631 1246876 0.9983 0.6870
-- 632- 667 1117426 0.9984 0.6895
-- 668- 704 1000212 0.9985 0.6919
-- 705- 742 906108 0.9986 0.6942
-- 743- 781 831236 0.9987 0.6963
-- 782- 821 769439 0.9987 0.6984
--
-- 0 (max occurrences)
-- 302249381021 (total mers, non-unique)
-- 13638973170 (distinct mers, non-unique)
-- 0 (unique mers)
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 518623 1.86 16285.48 +- 19560.75 5878.19 +- 8552.03 (bad trimming)
-- middle-hump 95635 0.34 54461.70 +- 34489.89 45865.76 +- 33997.00 (bad trimming)
-- no-5-prime 1337085 4.80 11999.74 +- 14697.89 7278.59 +- 9648.14 (bad trimming)
-- no-3-prime 1203484 4.32 12992.64 +- 14319.36 8560.03 +- 10656.44 (bad trimming)
--
-- low-coverage 6691132 24.01 11068.68 +- 7303.62 37.94 +- 15.70 (easy to assemble, potential for lower quality consensus)
-- unique 298216 1.07 10139.35 +- 3018.58 144.16 +- 63.18 (easy to assemble, perfect, yay)
-- repeat-cont 507529 1.82 10329.98 +- 1852.19 1144.77 +- 803.38 (potential for consensus errors, no impact on assembly)
-- repeat-dove 5287 0.02 16040.81 +- 1680.89 922.50 +- 590.18 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 1652304 5.93 10899.87 +- 4466.47 3641.59 +- 3372.03 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 1441999 5.17 10280.91 +- 3428.68 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 199969 0.72 13515.45 +- 3845.77 (will end contigs, potential to misassemble)
-- uniq-anchor 58792 0.21 11129.65 +- 2126.80 3941.75 +- 2525.74 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/ERROR RATES]
--
-- ERROR RATES
-- -----------
-- --------threshold------
-- 12510315 fraction error fraction percent
-- samples (1e-5) error error
-- -------------------------- -------- --------
-- command line (-eg) -> 3500.00 3.5000% (enabled)
-- command line (-ef) -> -----.-- ---.----%
-- command line (-eM) -> 3500.00 3.5000% (enabled)
-- mean + std.dev 358.55 +- 12 * 816.78 -> 10159.91 10.1599%
-- median + mad 0.00 +- 12 * 0.00 -> 0.00 0.0000%
-- 90th percentile -> 1531.00 1.5310%
--
-- BEST EDGE FILTERING
-- -------------------
-- At graph threshold 3.5000%, reads:
-- available to have edges: 19514378
-- with at least one edge: 2149552
--
-- At max threshold 3.5000%, reads:
-- available to have edges: 19514378
-- with at least one edge: 2149552
--
-- At tight threshold 1.5310%, reads with:
-- both edges below error threshold: 1319401 (80.00% minReadsBest threshold = 1719641)
-- one edge above error threshold: 631796
-- both edges above error threshold: 198355
-- at least one edge: 2149552
--
-- At loose threshold 3.5000%, reads with:
-- both edges below error threshold: 2149552 (80.00% minReadsBest threshold = 1719641)
-- one edge above error threshold: 0
-- both edges above error threshold: 0
-- at least one edge: 2149552
--
--
-- INITIAL EDGES
-- -------- ----------------------------------------
-- 8280442 reads are contained
-- 17963337 reads have no best edges (singleton)
-- 18305 reads have only one best edge (spur)
-- 8741 are mutual best
-- 1604574 reads have two best edges
-- 141631 have one mutual best edge
-- 1411337 have two mutual best edges
--
--
-- FINAL EDGES
-- -------- ----------------------------------------
-- 8280442 reads are contained
-- 17983620 reads have no best edges (singleton)
-- 15224 reads have only one best edge (spur)
-- 8918 are mutual best
-- 1587372 reads have two best edges
-- 134864 have one mutual best edge
-- 1411110 have two mutual best edges
--
--
-- EDGE FILTERING
-- -------- ------------------------------------------
-- 0 reads are ignored
-- 922549 reads have a gap in overlap coverage
-- 11092 reads have lopsided best edges
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 35593 sequences, total length 2784828001 bp (including 34946 repeats of total length 813887353 bp).
-- bubbles: 35145 sequences, total length 848801757 bp.
-- unassembled: 18178076 sequences, total length 220604741407 bp.
--
-- Contig sizes based on genome size 2.7gbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1153548 155 270770228
-- 20 752251 450 540711693
-- 30 533999 878 810266116
-- 40 381898 1478 1080015113
-- 50 261277 2337 1350257365
-- 60 174190 3614 1620014777
-- 70 117024 5519 1890102305
-- 80 69357 8480 2160025284
-- 90 28037 15121 2430008182
-- 100 13722 28410 2700010540
The report says 195x not 45x that you had mentioned originally. That doesn't look like HiFi data though, it's way too long and there's almost no peak in the histogram. The assembly is also quite fragmented. It also looks like you're running with higher error rate and coverage than the defaults for HiFi data, what's your full command?
Okay. Here is the data output from the sequencing facilty
Name HiFi_Reads HIFi_BP HiFi_mean_len SC7_HTpool_42PM_CELL2 1,316,217 19,255,575,399 14,629 SC7_HTpool_50PM_CELL1 1,339,709 19,554,913,662 14,596 SC7_HTpool_33PM_CELL3 1,780,469 26,160,968,655 14,693 SC7_HTpool_33PM_CELL4 1,810,309 26,597,809,556 14,692 SC7_HTpool_33PM_CELL5 1,654,212 24,236,775,031 14,651
Command below. But I think it seems like we might have read in the CLR or additioanl data somehow. Sounds like I should restart from the beginning? Before that - command below:
module load gcc/9.3.0 module load canu/2.2
canu \ -p Sc -d Sc-pacbio_2nodes \ genomeSize=2.7g \ -pacbio-hifi HiFi_reads/trim-cell[1-5]* \ correctedErrorRate=0.035 \ utgOvlErrorRate=0.065 \ trimReadsCoverage=2 \ trimReadsOverlap=500 \ -maxMemory=128G \ -maxThreads=64 \ -executiveMemory=16 \ gridOptions="--account=rrg-shaferab --cpus-per-task=32 --time=05:59:00"
Yeah, I'd restart from the beginning. I think what you have is the SequelIe output which outputs 1 read per ZMW, even if it's a CLR read. Filter the files for Q20 reads first. You also shouldn't need the correctedErrorRate=0.035 utgOvlErrorRate=0.065 trimReadsCoverage=2 trimReadsOverlap=500
parameters then.
for q20 reads - that extracts reads that have an average Q>20, so I can use any QC program?
Much appreciated your quick and detailed feedback, I imagine this will run much smoother once I get rid of the garbage.
I've used the PacBio dataset command before: https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v10.1.pdf
, see page 15. I assume other tools would work. If you have the BAMs for the cells, there should be a tag indicating QV for each read as well that you can use to filter.
Idle, input was not HiFi data and likely needed filtering.
Hi, I have ~45x HiFi data (2.7 GB genome). I use the grid option but often run individual sbatch scrips due to some incompatibilities with slurm (i.e. submitting 4000 jobs or oom events). It's worked well and I'm on the consensus phase; however, one of my partition files is 38G, more than double the next closes. This variation has led me to run indivdiual consensus jobs for 57 arrays. This particular file is getting ooms but now running at 32 CPUs @8G per.
I guess my question is does this seem out of the ordinary? And it's not clear to me if I can pick up from XXX.cns.WORKING if this times out? And if so, is that from the general canu script or the one specific to the module (below). Much appreciated.