Closed dtzhu337 closed 5 years ago
That sounds like your grid might not be letting Canu submit the jobs. What's the output in correction/1-overlapper (any out and sh files in there)? What's in canu-scripts?
That sounds like your grid might not be letting Canu submit the jobs. What's the output in correction/1-overlapper (any out and sh files in there)? What's in canu-scripts?
Hi Skoren,
Cause we have very limited storage in the server files, I have deleted those files.
The manager of server also told me it probably due to the job submission, and recommended me to use useGrid=false option. It seems like it runs well currently, at least more than 10 hours.
Thank you
That sounds like your grid might not be letting Canu submit the jobs. What's the output in correction/1-overlapper (any out and sh files in there)? What's in canu-scripts?
Hi Skoren,
I found my job finished, after use useGrid=false option. But the problem is that there is no fasta files showing the assembly results. I've only got the directory/files below. antPacbio.seqStore.err canu-logs correction antPacbio.seqStore antPacbio.seqStore.ssi canu-scripts haplotype
What do you think is the problem?
Thank you
If the output didn't get generated but the job stopped, it was probably terminated by your scheduler. You'd have to check the history of the job and the output of stdout/stderr from Canu to get that information.
When you run with useGrid=false, you're restricting canu to that single node where you requested 120gb of memory. If you land on a node with more memory than this, Canu might exceed your memory request and fail, it is safer to reserve a full node in these cases. Running with useGrid=false is also going to be much slower than using the grid so you may still want to ask your IT to diagnose the previous submission issue.
If the output didn't get generated but the job stopped, it was probably terminated by your scheduler. You'd have to check the history of the job and the output of stdout/stderr from Canu to get that information.
When you run with useGrid=false, you're restricting canu to that single node where you requested 120gb of memory. If you land on a node with more memory than this, Canu might exceed your memory request and fail, it is safer to reserve a full node in these cases. Running with useGrid=false is also going to be much slower than using the grid so you may still want to ask your IT to diagnose the previous submission issue.
The server manager told me that I should have useGrid=false option.
Do you have any idea about how long does it need for canu to assemble the genome? The estimate size is ~280M, and I've got 36GB reads data (transferred to .fq file from .bam files already). I am now using 1000GB memory to continue running the previously stopped one.
From the instruction on the website, I think using the same script to continue is okay for canu. I just wanna make sure this is right.
Best wishes and Thank you
The useGrid=false is the easiest solution since then you don't have to find out why the grid submission job from canu was rejected. However, others have run Canu on PBS grids so it should work. The only issue would be if your run nodes aren't allowed to submit jobs (see FAQ).
A 280 mb genome is not too big so I would guess less than a week. Rather than picking a machine with lots of memory, tell Canu how much memory/threads it is allowed to use. That is, if you reserve 200gb/16 cores then add the options maxMemory=200 maxThreads=16
and it will configure itself to fit.
You can restart with the same script yes.
Hi skoren,
The process has been several days now. It still has no canu.out stuff. And I am quite not sure about which step it is performing currently.
There is a file named .seqStore.err showing the following information.
Starting file './antPacbio.seqStore.ssi'.
Loading reads from '/storage/home/d/duz193/work/allreads.fq'
Processed 11204220 lines.
Loaded 18715844996 bp from:
2240844 FASTQ format reads (18715844996 bp).
WARNING: 215253 reads (9.6059%) with 100624002 bp (0.5348%) were too short (< 1000bp) and were ignored.
Finished with:
0 warnings (bad base or qv, too short, too long)
Loaded into store:
18715844996 bp.
2025591 reads.
Skipped (too short):
100624002 bp (0.5348%).
215253 reads (9.6059%).
sqStoreCreate finished successfully.
The canu-scripts is empty, and correction directory only have 0-mercounts and 1-overlapper.
The canu-logs showing the following,
Do you think the software is still running well? The last thing I want to see is that after so many days waiting, it shows nothing. Thank you in advance for your kind help.
Warm regards
Yes, it's probably fine, there isn't going to be a canu.out when you run it with useGrid=false and the canu-scripts folder will be empty as well. All the logging is going to stdout/stderr which should be captured by your grid engine and put into whatever file is the default (I didn't see an output file specification in your script).
You should also be able to use your grid engine to monitor the submitted job to see its resource utilization.
Yes, it's probably fine, there isn't going to be a canu.out when you run it with useGrid=false and the canu-scripts folder will be empty as well. All the logging is going to stdout/stderr which should be captured by your grid engine and put into whatever file is the default (I didn't see an output file specification in your script).
You should also be able to use your grid engine to monitor the submitted job to see its resource utilization.
Hi again,
It turns out the task has been finished, however I could not find any fasta files showing the contigs.
Same as before, if the fasta files are not there the job did not terminate correctly. Post the stdout/stderr from the submited job which should have more details on what happened along with the job history/accounting of that job (e.g. how long it ran for, memory used/etc).
-- Running jobs. First attempt out of 2.
-- Starting 'cormhap' concurrent execution on Wed Dec 5 11:12:06 2018 with 655197.215 GB free disk space (46 processes; 4 concurrently)
cd correction/1-overlapper
./ 101 > ./mhap.000101.out 2>&1
./ 102 > ./mhap.000102.out 2>&1
./ 103 > ./mhap.000103.out 2>&1
./ 104 > ./mhap.000104.out 2>&1
./ 105 > ./mhap.000105.out 2>&1
./ 106 > ./mhap.000106.out 2>&1
./ 107 > ./mhap.000107.out 2>&1
./ 108 > ./mhap.000108.out 2>&1
./ 109 > ./mhap.000109.out 2>&1
./ 110 > ./mhap.000110.out 2>&1
./ 111 > ./mhap.000111.out 2>&1
./ 112 > ./mhap.000112.out 2>&1
./ 113 > ./mhap.000113.out 2>&1
./ 114 > ./mhap.000114.out 2>&1
./ 115 > ./mhap.000115.out 2>&1
./ 116 > ./mhap.000116.out 2>&1
./ 117 > ./mhap.000117.out 2>&1
./ 118 > ./mhap.000118.out 2>&1
./ 119 > ./mhap.000119.out 2>&1
./ 120 > ./mhap.000120.out 2>&1
./ 121 > ./mhap.000121.out 2>&1
./ 122 > ./mhap.000122.out 2>&1
./ 123 > ./mhap.000123.out 2>&1
./ 124 > ./mhap.000124.out 2>&1
./ 125 > ./mhap.000125.out 2>&1
./ 126 > ./mhap.000126.out 2>&1
./ 127 > ./mhap.000127.out 2>&1
./ 128 > ./mhap.000128.out 2>&1
./ 129 > ./mhap.000129.out 2>&1
./ 130 > ./mhap.000130.out 2>&1
./ 131 > ./mhap.000131.out 2>&1
./ 132 > ./mhap.000132.out 2>&1
./ 133 > ./mhap.000133.out 2>&1
./ 134 > ./mhap.000134.out 2>&1
./ 135 > ./mhap.000135.out 2>&1
./ 136 > ./mhap.000136.out 2>&1
./ 137 > ./mhap.000137.out 2>&1
./ 138 > ./mhap.000138.out 2>&1
./ 139 > ./mhap.000139.out 2>&1
./ 140 > ./mhap.000140.out 2>&1
./ 141 > ./mhap.000141.out 2>&1
./ 142 > ./mhap.000142.out 2>&1
./ 143 > ./mhap.000143.out 2>&1
./ 144 > ./mhap.000144.out 2>&1
./ 145 > ./mhap.000145.out 2>&1
./ 146 > ./mhap.000146.out 2>&1
-- Finished on Thu Dec 6 05:23:29 2018 (65483 seconds, fashionably late) with 655101.323 GB free disk space
-- Mhap overlap jobs failed, retry.
-- job correction/1-overlapper/results/000114.ovb FAILED.
-- job correction/1-overlapper/results/000116.ovb FAILED.
-- job correction/1-overlapper/results/000118.ovb FAILED.
-- job correction/1-overlapper/results/000119.ovb FAILED.
-- job correction/1-overlapper/results/000121.ovb FAILED.
-- job correction/1-overlapper/results/000122.ovb FAILED.
-- Running jobs. Second attempt out of 2.
-- Starting 'cormhap' concurrent execution on Thu Dec 6 05:23:30 2018 with 655101.323 GB free disk space (6 processes; 4 concurrently)
cd correction/1-overlapper
./ 114 > ./mhap.000114.out 2>&1
./ 116 > ./mhap.000116.out 2>&1
./ 118 > ./mhap.000118.out 2>&1
./ 119 > ./mhap.000119.out 2>&1
./ 121 > ./mhap.000121.out 2>&1
./ 122 > ./mhap.000122.out 2>&1
-- Finished on Thu Dec 6 05:24:38 2018 (68 seconds) with 655102.071 GB free disk space
-- Mhap overlap jobs failed, tried 2 times, giving up.
-- job correction/1-overlapper/results/000114.ovb FAILED.
-- job correction/1-overlapper/results/000116.ovb FAILED.
-- job correction/1-overlapper/results/000118.ovb FAILED.
-- job correction/1-overlapper/results/000119.ovb FAILED.
-- job correction/1-overlapper/results/000121.ovb FAILED.
-- job correction/1-overlapper/results/000122.ovb FAILED.
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
Hi skoren,
This is from the stderr document. I tried to restart canu and have another job submitted. The stderr showed
-- ERROR Limited to at most 1000 GB memory via maxMemory option
-- ERROR Limited to at most 1 threads via maxThreads option
-- ERROR Found 1 machine configuration:
-- ERROR class0 - 1 machines with 1 cores with 1000 GB memory each.
-- ERROR Task hap can't run on any available machines.
-- ERROR It is requesting:
-- ERROR hapMemory=6-12 memory (gigabytes)
-- ERROR hapThreads=8-24 threads
-- ERROR No available machine configuration can run this task.
-- ERROR Possible solutions:
-- ERROR Increase maxMemory
-- ERROR Change hapMemory and/or hapThreads
Does it mean I should change some parameters of the script or need to require more resouce?
Thank you and best wishes
A subset of your jobs failed, what is the output in the failing files (correction/1-overlapper/*122*out
for example).
As for the second error, you restricted canu to one thread, I don't think that's what you want. The error is saying that for your size genome it wants at least 8 threads to run and I assume your 1tb node has more than 1 core that you're reserving.
Found perl:
Found java:
openjdk version "1.8.0_171"
Found canu:
Use of implicit split to @_ is deprecated at /storage/home/d/duz193/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/ line 73.
Canu 1.8
Running job 122 based on command line options.
Fetch blocks/000040.dat
Fetch blocks/000041.dat
Fetch blocks/000042.dat
Fetch blocks/000043.dat
Fetch blocks/000044.dat
Fetch blocks/000045.dat
Fetch blocks/000046.dat
Fetch blocks/000047.dat
Fetch blocks/000048.dat
Fetch blocks/000049.dat
Fetch blocks/000050.dat
Fetch blocks/000051.dat
Fetch blocks/000052.dat
Fetch blocks/000053.dat
Running block 000039 in query 000122
./ line 1001: 106046 Segmentation fault (core dumped) $bin/mhapConvert -S ../../antPacbio.seqStore -o ./results/$qry.mhap.ovb.WORKING ./results/$qry.mhap
Found perl:
Found java:
openjdk version "1.8.0_171"
Found canu:
Use of implicit split to @_ is deprecated at /storage/home/d/duz193/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/ line 73.
Canu 1.8
Running job 114 based on command line options.
Fetch blocks/000036.dat
Fetch blocks/000037.dat
Fetch blocks/000038.dat
Fetch blocks/000039.dat
Fetch blocks/000040.dat
Fetch blocks/000041.dat
Fetch blocks/000042.dat
Fetch blocks/000043.dat
Fetch blocks/000044.dat
Fetch blocks/000045.dat
Fetch blocks/000046.dat
Fetch blocks/000047.dat
Fetch blocks/000048.dat
Fetch blocks/000049.dat
Running block 000035 in query 000114
writeToFile()-- After writing 14964 out of 818379 'ovFile::writeBuffer::sb' objects (1 bytes each): Disk quota exceeded
Found perl:
Found java:
openjdk version "1.8.0_171"
Found canu:
Use of implicit split to @_ is deprecated at /storage/home/d/duz193/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/ line 73.
Canu 1.8
Running job 118 based on command line options.
Fetch blocks/000038.dat
Fetch blocks/000039.dat
Fetch blocks/000040.dat
Fetch blocks/000041.dat
Fetch blocks/000042.dat
Fetch blocks/000043.dat
Fetch blocks/000044.dat
Fetch blocks/000045.dat
Fetch blocks/000046.dat
Fetch blocks/000047.dat
Fetch blocks/000048.dat
Fetch blocks/000049.dat
Fetch blocks/000050.dat
Fetch blocks/000051.dat
Running block 000037 in query 000118
mhapConvert: mhap/mhapConvert.C:119: int main(int, char**): Assertion `W.toint32(6) <= W.toint32(7)' failed.
./ line 1001: 105971 Aborted (core dumped) $bin/mhapConvert -S ../../antPacbio.seqStore -o ./results/$qry.mhap.ovb.WORKING ./results/$qry.mhap
I still got over 20 GB in the disk, btw.
It looks like you're out of space and as a result have partial/corrupted output. Even if you say you have 20gb of disk available, the quota error indicates you probably reached that limit or another limit during the run. At least one job complains about writing output exceeding quota. The other errors are likely due to a truncated output file which was caused by the out-of-space issues.
Remove any files named correction/1-overlapper/results/*WORKING*
and correction/1-overlapper/results/*mhap*
, get your quota increased, and try again.
Hi skoren,
I fixed the previous problem now. But new issues comes.
ERROR: ERROR: Failed with exit code 1. (rc=256) ERROR:
Any ideas about what happened? Because of not enough overlapping?
Basically the same problem - job 119 ran out of space writing the output and left an incomplete or even empty output. Remove correction/1-overlapper/results/*0119* and
correction/1-overlapper/*files` and retry. There might be more than one such job, so check for any empty files in the results/ directory and remove those too!
Basically the same problem - job 119 ran out of space writing the output and left an incomplete or even empty output. Remove
correction/1-overlapper/results/*0119* and
correction/1-overlapper/*files` and retry. There might be more than one such job, so check for any empty files in the results/ directory and remove those too!
The last I found even if I deleted the job, it still could not get through. So I think there was maybe something overwritten in the process because of out of storage. So I tried to re-do all the stuff. But this time, the question still come out like this.
ABORT: ABORT: Canu 1.8 ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped. ABORT: Try restarting. If that doesn't work, ask for help. ABORT:
I tried to delete the /correction/1-overlapper/results/000146(all the numbers).working files, but it still didn't work. It's not a problem about space. Is there any thing I can do to this?
Thank you
As you're finding, out of space errors are insidious and really hard to fix.
I'd suggest starting overlaps again. It looks like it thinks nearly every overlap job failed, so a fresh start isn't as drastic as it sounds.
Remove the 1-overlapper directory, and any ovlStore files or directories. This should leave, I think, just 0-mercounts in the correction/ directory.
Gave up? Success? Or still running? Assuming you restarted, and it misbehaves again, open a new issue and refer back to this one.
Hi Brian,
Thank you so much for asking. It seems to have a new problem. I just submitted a new issue. Hope you can help me fix it.
Thank you
On Fri, Dec 21, 2018 at 12:31 AM Brian Walenz wrote:
Gave up? Success? Or still running? Assuming you restarted, and it misbehaves again, open a new issue and refer back to this one.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread .
I run canu in the campus server using batch submission script below:
I tried several times, canu automatically created other jobs (with different job ID). Then after cormhap step (I can check the job status and found the jobname is cormhap_antPacbi, shoule be generated by canu software itself), it could not process. In the directory it works, only showed the files below. The canu.out is empty.
Is there anyone who knows how to fix this problem? Is it the problem with my script, the canu software or the server?
Thank you