Closed JohnUrban closed 6 years ago
Here is debug output if it helps -- tried with a new set of reads (37x, error-free):
$ abruijn.py --debug reads.37x.bedtools.fasta out 37
[17:00:54] INFO: Running ABruijn
[17:00:55] INFO: Assembling reads
[17:00:55] DEBUG: Reading FASTA
[17:00:55] DEBUG: Hard threshold set to 3
[17:00:55] INFO: Indexing kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[17:00:58] INFO: Indexing kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[17:00:59] DEBUG: Genome size estimate: 48600
[17:00:59] WARNING: Unable to choose minimum kmer count cutoff. Check if the coverage parameter is correct. Running with default parameter t = 4
[17:00:59] DEBUG: Filtered 0 repetitive kmers
[17:00:59] DEBUG: Estimated minimum kmer coverage: 4, 98402 unique kmers selected
[17:00:59] DEBUG: Initial size: 93382
[17:00:59] DEBUG: Removed 672 entries
[17:00:59] INFO: Building read index
[17:00:59] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[17:01:15] DEBUG: Overlap-estimated coverage: 112
[17:01:15] DEBUG: 0 sequences were marked as chimeric
[17:01:15] INFO: Extending reads
[17:01:15] DEBUG: Start Read: -lambdaGenome:33412-48412
[17:01:15] DEBUG: Branch index 1.25217
[17:01:15] DEBUG: -lambdaGenome:23931-38931 31 0 0.96 5504 9481
[17:01:15] DEBUG: -lambdaGenome:24038-39038 31 0 1.00 5611 9374
[17:01:15] DEBUG: -lambdaGenome:24699-39699 29 0 0.95 6272 8713
[17:01:15] DEBUG: -lambdaGenome:24811-39811 27 0 0.95 6384 8601
[17:01:15] DEBUG: -lambdaGenome:25228-40228 24 0 0.91 6801 8184
[17:01:15] DEBUG: -lambdaGenome:25740-40740 21 2 0.92 7313 7672
[17:01:15] DEBUG: -lambdaGenome:26178-41178 19 2 0.92 7751 7234
[17:01:15] DEBUG: -lambdaGenome:26299-41299 18 3 0.95 7872 7113
[17:01:15] DEBUG: -lambdaGenome:26310-41310 18 3 0.95 7883 7102
[17:01:15] DEBUG: -lambdaGenome:26511-41511 16 4 0.99 8084 6901
[17:01:15] DEBUG: -lambdaGenome:26577-41577 16 4 0.95 8150 6835
[17:01:15] DEBUG: -lambdaGenome:26676-41676 15 4 0.95 8249 6736
[17:01:15] DEBUG: -lambdaGenome:26823-41823 15 5 0.99 8396 6589
[17:01:15] DEBUG: -lambdaGenome:26922-41922 13 5 0.96 8495 6490
[17:01:15] DEBUG: -lambdaGenome:29481-44481 5 20 1.12 11054 3931
[17:01:15] DEBUG: -lambdaGenome:29519-44519 5 20 1.12 11092 3893
[17:01:15] DEBUG: -lambdaGenome:29831-44831 4 21 1.09 11404 3581
[17:01:15] DEBUG: -lambdaGenome:30484-45484 0 25 1.23 12057 2928
[17:01:15] DEBUG: -lambdaGenome:30600-45600 0 26 1.23 12173 2812
[17:01:15] DEBUG: -lambdaGenome:31072-46072 0 28 1.26 12645 2340
[17:01:15] DEBUG: -lambdaGenome:31835-46835 0 29 1.19 13408 1577
[17:01:15] DEBUG: -lambdaGenome:31823-46823 0 29 1.19 13396 1589
[17:01:15] DEBUG: -lambdaGenome:31452-46452 0 29 1.19 13025 1960
[17:01:15] DEBUG: -lambdaGenome:31334-46334 0 29 1.19 12907 2078
[17:01:15] DEBUG: -lambdaGenome:29048-44048 6 17 1.07 10621 4364
[17:01:15] DEBUG: -lambdaGenome:28794-43794 7 15 1.07 10367 4618
[17:01:15] DEBUG: -lambdaGenome:28461-43461 7 14 1.04 10034 4951
[17:01:15] DEBUG: -lambdaGenome:28377-43377 7 13 1.00 9950 5035
[17:01:15] DEBUG: -lambdaGenome:28332-43332 7 13 1.03 9905 5080
[17:01:15] DEBUG: -lambdaGenome:28104-43104 8 11 0.96 9677 5308
[17:01:15] DEBUG: -lambdaGenome:27928-42928 10 9 0.89 9501 5484
[17:01:15] DEBUG: -lambdaGenome:27887-42887 10 9 0.93 9460 5525
[17:01:15] DEBUG: -lambdaGenome:27784-42784 10 7 0.89 9357 5628
[17:01:15] DEBUG: -lambdaGenome:27520-42520 11 6 0.96 9093 5892
[17:01:15] DEBUG: -lambdaGenome:27380-42380 11 6 0.96 8953 6032
[17:01:15] DEBUG: -lambdaGenome:27052-42052 12 5 0.96 8625 6360
[17:01:15] DEBUG: Extension: -lambdaGenome:27928-42928
[17:01:15] DEBUG: Branch index 0.89
[17:01:15] DEBUG: -lambdaGenome:18375-33375 18 0 0.96 5432 9553
[17:01:15] DEBUG: -lambdaGenome:18964-33964 18 0 0.93 6021 8964
[17:01:15] DEBUG: -lambdaGenome:18977-33977 18 0 0.93 6034 8951
[17:01:15] DEBUG: -lambdaGenome:26178-41178 0 18 0.92 13235 1750
[17:01:15] DEBUG: -lambdaGenome:25740-40740 0 18 0.92 12797 2188
[17:01:15] DEBUG: -lambdaGenome:25228-40228 0 16 0.91 12285 2700
[17:01:15] DEBUG: -lambdaGenome:24811-39811 0 16 0.95 11868 3117
[17:01:15] DEBUG: -lambdaGenome:24699-39699 2 16 0.95 11756 3229
[17:01:15] DEBUG: -lambdaGenome:24038-39038 4 14 1.00 11095 3890
[17:01:15] DEBUG: -lambdaGenome:22455-37455 8 9 0.99 9512 5473
[17:01:15] DEBUG: -lambdaGenome:22911-37911 7 12 1.09 9968 5017
[17:01:15] DEBUG: -lambdaGenome:23050-38050 7 12 1.09 10107 4878
[17:01:15] DEBUG: -lambdaGenome:23931-38931 4 13 0.96 10988 3997
[17:01:15] DEBUG: -lambdaGenome:26299-41299 0 19 0.95 13356 1629
[17:01:15] DEBUG: -lambdaGenome:26310-41310 0 19 0.95 13367 1618
[17:01:15] DEBUG: -lambdaGenome:22276-37276 9 8 0.96 9333 5652
[17:01:15] DEBUG: -lambdaGenome:21268-36268 11 5 0.99 8325 6660
[17:01:15] DEBUG: -lambdaGenome:21137-36137 11 5 1.06 8194 6791
[17:01:15] DEBUG: -lambdaGenome:21109-36109 11 5 1.06 8166 6819
[17:01:15] DEBUG: -lambdaGenome:20950-35950 12 5 1.06 8007 6978
[17:01:15] DEBUG: -lambdaGenome:20611-35611 13 3 0.99 7668 7317
[17:01:15] DEBUG: -lambdaGenome:19805-34805 13 0 1.04 6862 8123
[17:01:15] DEBUG: -lambdaGenome:19804-34804 13 0 1.04 6861 8124
[17:01:15] DEBUG: -lambdaGenome:19411-34411 17 0 1.00 6468 8517
[17:01:15] DEBUG: -lambdaGenome:19345-34345 17 0 0.97 6402 8583
[17:01:15] DEBUG: Extension: -lambdaGenome:22455-37455
[17:01:15] DEBUG: Branch index 0.99
[17:01:15] DEBUG: -lambdaGenome:19805-34805 0 19 1.04 12335 2650
[17:01:15] DEBUG: -lambdaGenome:19804-34804 0 19 1.04 12334 2651
[17:01:15] DEBUG: -lambdaGenome:13493-28493 20 0 1.30 6023 8962
[17:01:15] DEBUG: -lambdaGenome:19411-34411 1 18 1.00 11941 3044
[17:01:15] DEBUG: -lambdaGenome:13336-28336 20 0 1.35 5866 9119
[17:01:15] DEBUG: -lambdaGenome:19345-34345 1 17 0.97 11875 3110
[17:01:15] DEBUG: -lambdaGenome:16874-31874 9 9 0.93 9404 5581
[17:01:15] DEBUG: -lambdaGenome:16551-31551 9 8 0.98 9081 5904
[17:01:15] DEBUG: -lambdaGenome:15555-30555 15 5 1.00 8085 6900
[17:01:15] DEBUG: -lambdaGenome:15447-30447 15 5 1.00 7977 7008
[17:01:15] DEBUG: -lambdaGenome:15156-30156 16 4 0.97 7686 7299
[17:01:15] DEBUG: -lambdaGenome:14671-29671 17 1 1.03 7201 7784
[17:01:15] DEBUG: -lambdaGenome:13527-28527 20 0 1.30 6057 8928
[17:01:15] DEBUG: -lambdaGenome:20611-35611 0 22 0.99 13141 1844
[17:01:15] DEBUG: -lambdaGenome:13732-28732 19 0 1.26 6262 8723
[17:01:15] DEBUG: -lambdaGenome:20950-35950 0 24 1.06 13480 1505
[17:01:15] DEBUG: -lambdaGenome:14072-29072 17 0 1.06 6602 8383
[17:01:15] DEBUG: -lambdaGenome:14282-29282 17 0 1.00 6812 8173
[17:01:15] DEBUG: -lambdaGenome:17181-32181 8 11 1.01 9711 5274
[17:01:15] DEBUG: -lambdaGenome:17536-32536 6 11 0.97 10066 4919
[17:01:15] DEBUG: -lambdaGenome:17562-32562 6 11 0.97 10092 4893
[17:01:15] DEBUG: -lambdaGenome:17761-32761 6 11 0.93 10291 4694
[17:01:15] DEBUG: -lambdaGenome:17857-32857 5 11 0.93 10387 4598
[17:01:15] DEBUG: -lambdaGenome:17916-32916 4 11 0.93 10446 4539
[17:01:15] DEBUG: -lambdaGenome:18375-33375 2 13 0.96 10905 4080
[17:01:15] DEBUG: -lambdaGenome:18964-33964 2 14 0.93 11494 3491
[17:01:15] DEBUG: -lambdaGenome:18977-33977 2 14 0.93 11507 3478
[17:01:15] DEBUG: -lambdaGenome:13151-28151 21 0 1.42 5681 9304
[17:01:15] DEBUG: Extension: -lambdaGenome:16874-31874
[17:01:15] DEBUG: Branch index 0.93
[17:01:15] DEBUG: -lambdaGenome:12167-27167 5 12 1.63 10278 4707
[17:01:15] DEBUG: -lambdaGenome:11916-26916 7 11 1.70 10027 4958
[17:01:15] DEBUG: -lambdaGenome:11579-26579 9 8 1.67 9690 5295
[17:01:15] DEBUG: -lambdaGenome:11465-26465 9 7 1.79 9576 5409
[17:01:15] DEBUG: -lambdaGenome:13151-28151 2 16 1.42 11262 3723
[17:01:15] DEBUG: -lambdaGenome:8591-23591 19 0 3.17 6702 8283
[17:01:15] DEBUG: -lambdaGenome:9062-24062 15 1 3.10 7173 7812
[17:01:15] DEBUG: -lambdaGenome:9217-24217 15 2 3.05 7328 7657
[17:01:15] DEBUG: -lambdaGenome:9971-24971 12 3 2.98 8082 6903
[17:01:15] DEBUG: -lambdaGenome:10221-25221 11 5 2.87 8332 6653
[17:01:15] DEBUG: -lambdaGenome:10256-25256 11 5 2.70 8367 6618
[17:01:15] DEBUG: -lambdaGenome:11275-26275 9 7 1.79 9386 5599
[17:01:15] DEBUG: -lambdaGenome:11217-26217 9 7 1.79 9328 5657
[17:01:15] DEBUG: -lambdaGenome:10490-25490 10 5 2.22 8601 6384
[17:01:15] DEBUG: -lambdaGenome:10262-25262 11 5 2.70 8373 6612
[17:01:15] DEBUG: -lambdaGenome:8518-23518 19 0 3.17 6629 8356
[17:01:15] DEBUG: -lambdaGenome:15156-30156 0 22 0.97 13267 1718
[17:01:15] DEBUG: -lambdaGenome:8292-23292 20 0 3.17 6403 8582
[17:01:15] DEBUG: -lambdaGenome:14671-29671 0 19 1.03 12782 2203
[17:01:15] DEBUG: -lambdaGenome:7583-22583 21 0 3.65 5694 9291
[17:01:15] DEBUG: -lambdaGenome:14282-29282 0 18 1.00 12393 2592
[17:01:15] DEBUG: -lambdaGenome:7533-22533 22 0 3.65 5644 9341
[17:01:15] DEBUG: -lambdaGenome:14072-29072 0 18 1.06 12183 2802
[17:01:15] DEBUG: -lambdaGenome:13493-28493 1 17 1.30 11604 3381
[17:01:15] DEBUG: -lambdaGenome:13527-28527 1 17 1.30 11638 3347
[17:01:15] DEBUG: -lambdaGenome:13732-28732 0 18 1.26 11843 3142
[17:01:15] DEBUG: -lambdaGenome:13336-28336 1 16 1.35 11447 3538
[17:01:15] DEBUG: Extension: -lambdaGenome:11579-26579
[17:01:15] DEBUG: Branch index 1.67
[17:01:15] DEBUG: -lambdaGenome:9971-24971 0 23 2.98 13377 1608
[17:01:15] DEBUG: -lambdaGenome:9062-24062 0 21 3.10 12468 2517
[17:01:15] DEBUG: -lambdaGenome:9217-24217 0 22 3.05 12623 2362
[17:01:15] DEBUG: -lambdaGenome:8591-23591 0 20 3.17 11997 2988
[17:01:15] DEBUG: -lambdaGenome:8518-23518 0 20 3.17 11924 3061
[17:01:15] DEBUG: -lambdaGenome:8292-23292 1 20 3.17 11698 3287
[17:01:15] DEBUG: -lambdaGenome:2944-17944 16 0 0.00 6350 8635
[17:01:15] DEBUG: -lambdaGenome:3217-18217 16 0 0.00 6623 8362
[17:01:15] DEBUG: -lambdaGenome:3452-18452 15 0 0.00 6858 8127
[17:01:15] DEBUG: -lambdaGenome:3623-18623 13 1 20.25 7029 7956
[17:01:15] DEBUG: -lambdaGenome:3775-18775 13 1 20.25 7181 7804
[17:01:15] DEBUG: -lambdaGenome:3850-18850 13 1 20.25 7256 7729
[17:01:15] DEBUG: -lambdaGenome:4974-19974 11 7 4.69 8380 6605
[17:01:15] DEBUG: -lambdaGenome:4887-19887 11 6 4.90 8293 6692
[17:01:15] DEBUG: -lambdaGenome:4274-19274 12 3 6.72 7680 7305
[17:01:15] DEBUG: -lambdaGenome:4024-19024 13 1 20.25 7430 7555
[17:01:15] DEBUG: -lambdaGenome:2851-17851 16 0 0.00 6257 8728
[17:01:15] DEBUG: -lambdaGenome:2568-17568 17 0 0.00 5974 9011
[17:01:15] DEBUG: -lambdaGenome:2028-17028 21 0 0.00 5434 9551
[17:01:15] DEBUG: -lambdaGenome:2550-17550 17 0 0.00 5956 9029
[17:01:15] DEBUG: -lambdaGenome:4976-19976 11 7 4.69 8382 6603
[17:01:15] DEBUG: -lambdaGenome:5677-20677 8 11 4.30 9083 5902
[17:01:15] DEBUG: -lambdaGenome:6356-21356 6 12 4.21 9762 5223
[17:01:15] DEBUG: -lambdaGenome:6548-21548 6 15 3.81 9954 5031
[17:01:15] DEBUG: -lambdaGenome:6767-21767 6 15 3.81 10173 4812
[17:01:15] DEBUG: -lambdaGenome:6769-21769 6 15 3.81 10175 4810
[17:01:15] DEBUG: -lambdaGenome:7533-22533 3 16 3.65 10939 4046
[17:01:15] DEBUG: -lambdaGenome:7583-22583 2 16 3.65 10989 3996
[17:01:15] DEBUG: Extension: -lambdaGenome:3452-18452
[17:01:15] DEBUG: Branch index 0.00
[17:01:15] DEBUG: No extension found
[17:01:15] DEBUG: Changing direction
[17:01:15] DEBUG: Branch index 0.00
[17:01:15] DEBUG: No extension found
[17:01:15] DEBUG: Linear contig
[17:01:15] DEBUG: Made 5 extensions
[17:01:15] INFO: Assembled 0 contigs
[17:01:15] INFO: Generating contig sequences
[17:01:15] INFO: Running Blasr
[17:01:15] DEBUG: Reading contigs file
ERROR, Fail to load FASTA file /gpfs/scratch/jurban/male-ilmn/lambda/abruijn/out/draft_assembly.fasta to virtual memory.
[17:01:15] ERROR: While running blasr: Command '['blasr', 'reads.37x.bedtools.fasta', '/gpfs/scratch/jurban/male-ilmn/lambda/abruijn/out/draft_assembly.fasta', '-bestn', '1', '-minMatch', '15', '-maxMatch', '25', '-m', '5', '-nproc', '1', '-out', '/gpfs/scratch/jurban/male-ilmn/lambda/abruijn/out/alignment.m5']' returned non-zero exit status 1
[17:01:15] ERROR: Error: Error in alignment module, exiting
Then again -- it just worked with ecoli data downloaded here:
wget https://www.dropbox.com/s/tb78i5i3nrvm6rg/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads.fasta
wget https://www.dropbox.com/s/v6wwpn40gedj470/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads.fasta
wget https://www.dropbox.com/s/j61j2cvdxn4dx4g/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads.fasta
Well it got this far so far anyway:
[17:46:09] INFO: Running ABruijn
[17:46:09] INFO: Assembling reads
[17:46:35] INFO: Indexing kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[18:43:09] INFO: Indexing kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[18:51:41] INFO: Building read index
[18:52:49] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[19:49:51] INFO: Extending reads
[19:52:36] INFO: Assembled 1 contigs
[19:52:36] INFO: Generating contig sequences
[19:52:44] INFO: Running Blasr
[19:59:19] INFO: Chosen 'pacbio' polishing profile
[19:59:19] INFO: Separating draft genome into bubbles
The ecoli run finished successfully.
Why do you think the toy lambda stuff didnt work?
Hi,
Thanks for the feedback!
The syntenic dataset is likely not working because of an over-simplistic filtration rule for contigs that are too short. I will fix this shortly.
It should be fixed now, please check the new version.
I just installed the latest git code. Looks like I get the same error:
[16:50:26] INFO: Running Blasr
ERROR, Fail to load FASTA file /gpfs_home/jurban/software/abruijn/lamda/draft_assembly.fasta to virtual memory.
[16:50:26] ERROR: While running blasr: Command '['blasr', '/users/jurban/scratch/male-ilmn/lambda/abruijn/reads.fasta', '/gpfs_home/jurban/software/abruijn/lamda/draft_assembly.fasta', '-bestn', '1', '-minMatch', '15', '-maxMatch', '25', '-m', '5', '-nproc', '1', '-out', '/gpfs_home/jurban/software/abruijn/lamda/alignment.m5']' returned non-zero exit status 1
[16:50:26] ERROR: Error: Error in alignment module, exiting
Also, in case it is helpful, I noticed if you do:
./ABruijn/abruijn.py reads.fasta lamda 37
…but there is no such file as "reads.fasta", rather than failing with a helpful message, it will just hang at:
[16:51:26] INFO: Running ABruijn
[16:51:26] INFO: Assembling reads
Well it did just run for Lambda on a new set of randomly generated reads… so perhaps the first set I tried today was just bad somehow…
I think we can close this issue if you'd like.
Thanks for the help. I have it working on Ecoli and 1 lambda example so far.
I tried running it on a 300 Mb genome with ~40X coverage. It exceeded the 100 GB of memory I set for it (using SLURM), so the job was canceled. I will try 200 GB next. Do you have a feel for the memory requirements I might need?
Could you also post the log file for the case when it hanged?
For the large genomes, the current version is quite inefficient in terms of memory, but we are working on this. I expect we will have an optimized version shorty and I will let you now, when it's ready.
I can get it to hang with literally any string in the reads position -- e.g.:
abruijn.py nonexistent.fasta ecoli 220
abruijn.py --debug 12345 makethisdirforme 10
The log just looks like (no extra info with --debug):
[17:53:38] root: INFO: Running ABruijn
[17:53:38] root: INFO: Assembling reads
-----------Begin assembly log------------
[17:53:38] DEBUG: Reading FASTA
I look forward to the optimized version -- I have at least 500 GB I can throw at it for now. Hopefully that will be enough. Otherwise, I eagerly await the new code.
Will the new memory-efficient version be ready within 1-2 weeks? Or is it longer-term than that?
It is almost ready, we expect to release in a day or two.
The new version is the master branch, fell free to try it. For a 150M genome with 60x coverage it took about 150Gb of memory and ~3 days of running on 16 cores. For your dataset, the numbers should be relatively similar.
Let us know how it goes, and thanks for the valuable feedback so far!
Downloading -- will try running tonight.
I have commit dc209ee6b18db7961e7df4ec74f819a15c33c92a
-- seems to be the latest. It went above 200GB overnight and crashed. I will try allotting more.
Yes, this should be the latest version. Probably, it might take about 200-250-Gb.
Thanks. Yeah I increased to 300 GB. It has been running nearly 12 hours I think, and is further than it was last night:
## SLURM PROLOG ###############################################################
## Job ID : 12795572
## Job Name : abruijn-sciara
## Nodelist : smp011
## CPUs :
## Mem/Node : 307200 MB
## Directory : /gpfs/scratch/jurban/male-ilmn/abruijn
## Started : Wed Jun 15 05:32:30 EDT 2016
###############################################################################
module: loading 'blasr/2015Oct22-8cc8621'
[05:32:32] INFO: Running ABruijn
[05:32:32] INFO: Assembling reads
[05:41:52] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[06:21:31] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[10:29:30] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60%
Hi again,
It is up to this step:
[05:32:32] INFO: Running ABruijn
[05:32:32] INFO: Assembling reads
[05:41:52] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[06:21:31] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[10:29:30] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[20:14:04] INFO: Finding overlaps:
0% 10% 20% 30% 40%
There are only 15 hours left in the allotted time I gave it. I hope it finishes. Having said that, would it be sensible to have the option of storing intermediate files to be able to use later to continue from a later step? e.g. files from counting kmers and/or a kmer index file. That would be great if I could do:
abruijn.py reads.fasta out 42 -t 32 --continue
or
abruijn.py reads.fasta out -t 32 --kind kmerindex.file
feature request?
Anyway hopefully this finishes so I can let you know how it went. Otherwise, more waiting for the next run.
Thanks! Yes, overlap step is usually the most time consuming - the rest should take 30-60 minutes at most. And storing intermediate information is the part of our future plans (as well as other performance optimizations).
Well, in 48 hours with 300 GB RAM and 32 CPU, it made it to:
[20:14:04] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80%
I am starting over with the same parameters, but allotting 7 days.
Regarding your comment that everything should be faster after the overlap step, this did not appear to be true for a test ecoli run I did before restarting this job. Is that because of the high coverage or relatively smaller dataset or genome size?
[05:26:13] INFO: Running ABruijn
[05:26:13] INFO: Assembling reads
[05:26:21] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[05:26:42] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[05:27:31] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[05:29:14] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[05:30:53] INFO: Extending reads
[05:30:58] INFO: Assembled 1 contigs
[05:30:58] INFO: Generating contig sequences
[05:31:00] INFO: Running Blasr
[05:33:14] INFO: Chosen 'pacbio' polishing profile
[05:33:14] INFO: Separating draft genome into bubbles
[05:42:57] INFO: Polishing draft assembly
0% 10% 20% 30% 40%
It is still going.
I should clarify that the polishing step on that ecoli assembly started at 5:42 and was only 40% done ~5:56 when I wrote the comment. For the ecoli assembly, polishing seems to be the most time-consuming step (>>14 minutes), followed by the "separating the draft genome into bubbles" step (~10 minutes). All other steps combined were done in < 10 minutes.
Thanks!
Yes, I think I might underestimated time that polishing might take. It is more sensitive to the coverage, than genome size, but still it could be expensive as well.
The good thing is that after you got a draft assembly, you always can restart from the polishing step (I'll make a fix to make it more convenient).
Resuming at the polishing step if necessary would be a fantastic addition to Abruijn. Looking forward to it.
The draft assembly did pretty well. It ultimately was killed in the polishing step for exceeding memory allocation:
## SLURM PROLOG ###############################################################
## Job ID : 12809901
## Job Name : abruijn
## Nodelist : smp010
## CPUs :
## Mem/Node : 307200 MB
## Directory : /gpfs/scratch/jurban/male-ilmn/abruijn
## Started : Fri Jun 17 06:38:22 EDT 2016
###############################################################################
module: loading 'blasr/2015Oct22-8cc8621'
[06:38:23] INFO: Running ABruijn
[06:38:23] INFO: Assembling reads
[06:47:45] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[07:14:33] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[11:18:22] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[20:41:26] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[18:36:37] INFO: Extending reads
[18:51:32] INFO: Assembled 746 contigs
[18:51:32] INFO: Generating contig sequences
[19:10:03] INFO: Running Blasr
[09:52:32] INFO: Chosen 'pacbio' polishing profile
[09:52:32] INFO: Separating draft genome into bubbles
[20:54:21] INFO: Polishing draft assembly
0% 10% 20% 30% [15:07:59] ERROR: Error: Error while running polish binary: Command '['abruijn-polish', '/gpfs/scratch/jurban/male-ilmn/abruijn/sciara/bubbles.fasta', '/gpfs_home/jurban/software/abruijn/ABruijn/resource/pacbio_substitutions.mat', '/gpfs_home/jurban/software/abruijn/ABruijn/resource/pacbio_homopolymers.mat', '/gpfs/scratch/jurban/male-ilmn/abruijn/sciara/consensus.fasta', '-t', '32']' returned non-zero exit status -9
slurmstepd: Exceeded step memory limit at some point.
If interested, it took 4 days and 8.5 hours to get up to that point -- ~139.3 days of CPU time.
Interesting, it looks like polishing is also a bottleneck for large genomes. However, I think the memory issue should be easy to optimize. I will provide an updated version shortly (also with resuming capabilities).
Hi,
It took us a bit more than I expected, but the new version is ready. Now polishing takes significantly less memory and is faster (however, there might be some peaks in memory usage, especially in the end).
To restart your assembly, you need to do the following: (1) rename "alignment.m5" file to "blasr_1.m5" and (2) create "abruijn.save" file with the following contents:
{"error_profile": "", "stage_name": "alignment", "stage_id": 1}
Then, run ABruijn as previously, but with an extra "--resume" option.
Also, now we have two polishing iterations by default - the second usually brings some additional corrections, but their number is minor (comparing to the first iteration). So, you might want to run with "--iterations 1" option to save time.
Thanks for letting me know. I just re-submitted to the cluster. It looks like it will start tomorrow some time. I'll keep you updated.
Question about Abruijn: What will it do with really bad reads? Ignore them? For example, some nanopore reads are garbage -- e.g. just a nanopore rattling off random signal that is then basecalled. An assembler like Canu will discard the garbage reads in the correction steps since they are not supported. I am wondering if they are automatically discarded in some way using Abruijn or if it tries to force their incorporation somehow.
We do not have an explicit step to discard garbage reads, but reads without significant overlaps with other reads will be ignored during contig assembly.
update: It has been running for a little over 48 hours. The current messages are:
[06:49:13] INFO: Running ABruijn
[06:49:13] INFO: Resuming previous run
[06:52:16] INFO: Chosen 'pacbio' polishing profile
[06:52:16] INFO: Separating draft genome into bubbles
[13:34:38] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
The messages have not been updated for ~42 hours (i.e. above is all I have seen for that long). I am just making sure that is normal. I have been wondering if its frozen in some loop or something at the end of bubble correcting. Can I ask for some type of message that comes after "100%" of correcting bubbles in a future update e.g. … is it still correcting bubbles? or doing something else now?
Yes, we noticed that percentage counter performs incorrectly on large datasets.
A 100% way to check how much work is left is to look on files at the working directory: "consensus_1.fasta" should increase over time, and its final size should be approximately 20-30% larger than "draft_assembly.fasta"
[06:49:13] INFO: Running ABruijn
[06:49:13] INFO: Resuming previous run
[06:52:16] INFO: Chosen 'pacbio' polishing profile
[06:52:16] INFO: Separating draft genome into bubbles
[13:34:38] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[15:37:24] INFO: Done! Your assembly is in file: /gpfs/scratch/jurban/male-ilmn/abruijn/sciara/polished_1.fasta
wooooo!
Amazing. Great program. Can't wait to take a look at the assembly. :)
Great! Let me know how it goes :)
I hope all is well. Recently, I have been messing around with the latest ABruijn code. It requires more memory than previously at the correcting bubbles step -- close to 400GB max so far.
I was wondering if you guys have any suggestions on input reads -- should they be quality filtered at all or should there be minimum read lengths? Also, I was planning to further polish with Quiver and Pilon. Have you messed around with further polishing at all?
What could be useful for end users like me is having the option of breaking ABruijn up into steps (e.g. counting and indexing kmers, finding overlaps, assembling, polishing, bubble detection/correction,..) and noting which steps can be parallelized and roughly the memory required by each step (even if just put in approximate percentages of max mem needed throughout all steps). This way I can submit batch jobs that don't require hogging all my resources (e.g. 500 GB RAM, 32 threads) for several days, making it impossible to do much other work during that time. If there are steps that don't use the 32 threads and use much less memory, that would be good to know. I could then run those steps individually with lower resource requests. Are there any plans for something like that?
Are you using one from the master branch? It might be that we started to split bubbles in a bit more conservative way, so they tend to be longer and consume more memory. Anyway, we are planning to work on reducing memory consumption of polishing algorithm in the future.
We did not do any read filtering on our datasets. If you filter out short reads (say less than 10-15k) it might help to improve running time/memory, but should not affect the assembly very much. Of course, for polishing the more coverage you have, the more accurate the final assembly would be. Yes, we have tried to run Quiver over our polished assembly and it usually improves the quality a bit.
I think this might be a good idea to have a possibility to run parts of the pipeline separately - I will add command line options for that. I think, the most of the current bottlenecks are already parallel, but yes, there are some that are still run in a single thread (which could be improved).
I am using the latest master commit.
I can handle the large memory need, though I look forward to your updates that reduce memory consumption -- just thought I'd update you on what I was seeing. Even though I can technically handle the large memory required in that step, it means I need to strategize how I submit jobs to our cluster (run by SLURM). I could submit a job that requests 500 GB, 32 threads for several days, for example. This would let the ABruijn run finish uninterrupted. However, this is extremely wasteful of the resources if most of the ABruijn run doesn't actually need all those resources. It means those resources are sequestered for the ABruijn job, and that no one can access them until it ends. It affects me more b/c it means I max out the amount of RAM resources I am allowed to request at any time. So it means that any other job I submit during the run has to wait until it ends. My work around has been to submit the ABruijn job with requesting fewer resources (e.g. 300 GB RAM), and letting it go until it exceeds the memory limit and is killed by SLURM. Then I request more RAM and use --resume. Being able to submit each step individually with appropriate resources would make more sense than planning for it to fail. If/when you break it up into steps, I could actually set up a dependency chain that automatically submits the next step when the current step successfully finishes (similar to Canu and Falcon strategies).
Thanks for telling me your experience with Quiver and read filtering. It is interesting that removing reads shorter than 10-15k would not affect the assembly much -- since 10-15k reads are pretty long by most standards. In the future I will have to toy around with this, as it would of course be useful to define a length cutoff based on coverage to reduce the resources required and still get essentially the same assembly. Do you already have an idea of what is the min coverage from the longest reads that one would need to do this?
Do you think the following is just a not-enough-memory error:
[14:07:09] INFO: Running ABruijn
[14:07:09] INFO: Assembling reads
[14:18:30] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[15:15:43] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:23:15] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[10:00:24] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[10:18:47] INFO: Extending reads
[15:59:49] INFO: Assembled 527 contigs
[15:59:49] INFO: Generating contig sequences
[16:21:09] WARNING: No jump found! -m131208_162217_42177R_c100597222550000001823102305221423_s1_p0/138054/0_7468 : +m140111_035752_42177R_c100623072550000001823111308061464_s1_p0/116378/10930_24284
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[19:00:38] ERROR: Error: Error in assemble binary: Command '['abruijn-assemble', '-k', '15', '-l', '/gpfs/scratch/jurban/male-ilmn/abruijn/pb_all_ont_every/sciara/abruijn.log', '-t', '32', 'data/pbont.fa', '/gpfs/scratch/jurban/male-ilmn/abruijn/pb_all_ont_every/sciara/reads_order.fasta', '67']' returned non-zero exit status -6
Hi,
Thank you for sharing this, it would be useful for the future working directions :) For the min coverage - I think it might depend on the genome itself. For a relatively simple bacteria (E.Coli, for example) seems that 40x-50x is usually enough. If bacteria has a sophisticated repeat structure, you might need more reads just in case (sometimes 100-200x for very complicated ones).
For the last error log - it looks suspicious, assembly module should not allocate a lot of memory on that stage, so it might be a bug instead (such as allocation of 0 bytes). Did you happen to use the newest version from master for this (one that we pushed on Friday)?
I am using:
commit 250761bac5589d83eeb1e00e19ba7fe78cbc1738
Author: Mikhail Kolmogorov <fenderglass@gmail.com>
Date: Wed Jun 29 17:33:43 2016 -0700
reduced bubbles cache, new verion
This was the latest commit on github until very recently, though I can see you've been pretty busy! :)
Hmmm. Other than performance, how do you think the newer version of the code will affect the outcome (e.g. contig sizes, etc)?
These improvements were mostly focused on homopolymer error correction and detection of chimeric reads. So, if you haven't seen any major problems with these, I think that your results will not change significantly with the newer version.
Well you were right -- it is likely not memory. I tried with 500 GB instead of 400 GB. Same error:
[10:43:42] INFO: Extending reads
[16:13:41] INFO: Assembled 527 contigs
[16:13:41] INFO: Generating contig sequences
[16:35:05] WARNING: No jump found! -m131208_162217_42177R_c100597222550000001823102305221423_s1_p0/138054/0_7468 : +m140111_035752_42177R_c100623072550000001823111308061464_s1_p0/116378/10930_24284
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[19:08:45] ERROR: Error: Error in assemble binary: Command '['abruijn-assemble', '-k', '15', '-l', '/gpfs/scratch/jurban/male-ilmn/abruijn/pb_all_ont_every/sciara/abruijn.log', '-t', '32', 'data/pbont.fa', '/gpfs/scratch/jurban/male-ilmn/abruijn/pb_all_ont_every/sciara/reads_order.fasta', '67']' returned non-zero exit status -6
Please try the new version, most likely it should be fixed now.
Also, we've made some performance improvements - mostly everything is now parallel (except some relatively small parts that are harder to parallelize). Also, I expect polishing stage to consume less memory on large genomes now (but it needs to be checked, though). Finally, if you still want to run assembly separate from polishing, you may use a combination of --iterations 0 / --resume.
Will do - thanks.
Hi,
This is an exciting tool - thanks for developing it. I am eager to get it up and running.
I installed and tried a simple test with 37X coverage of simulated lambda reads. All are 15000 bp long and have no errors. I cannot seem to get abruijn.py to run.
If I try:
I get:
Note that
draft_assembly.fasta
is an empty file.I saw the warning:
WARNING: Unable to choose minimum kmer count cutoff. Check if the coverage parameter is correct. Running with default parameter t = 4
So I tried adding in a minimum cutoff instead of default auto, but when I add any arguments it gives another error:
Any advice to help me get up and running would be appreciated.
best, John