Closed quanc1989 closed 4 years ago
That seems like an issue with the FS and not releasing the running script. What if you just re-start the same canu command?
If I don't delete the files in ''5-consensus" and just re-start the same canu command, same "ABORT" emerged.
If I delete the files in ''5-consensus" and then re-start the same canu command through sbatch with a single node in grid, same "ABORT" emerged again.
But if I delete the files in ''5-consensus" and re-start the same canu command in the local node (without sbatch), it worked.
So How to avoid that problem when I run canu in multiple nodes?
"If I delete the files in ''5-consensus" and then re-start the same canu command through sbatch with a single node in grid, same "ABORT" emerged again."
This is not running in multiple nodes and should be equivalent to a single node without sbatch. At the very least canu can't tell the difference between those two runs. If it is running differently, I'd expect there is some difference in how the file system is mounted/accessed between your sbatch run and the local run. If you have scratch in the compute nodes, try running on that instead of a shared FS to see if it works. You'd have to work with your cluster IT to figure out what's different in the filesystem between the local node and the sbatch instance, I don't see anything that would fix it in canu.
Thanks a lot! I will try as you suggest.
Hi, when I run "slurm.scripts" on a grid as follows:
then
some times ( frequency is approximately equal to 0.5 ) I got this 'ABORT'
Here is the header of log (one node in grid):
I checked "5-consensus" and haven't found any abnormal information
Here is content of "consensus.000002.out"
Looking forward for your reply. Thank you!