Open wenduwan opened 1 year ago
Haven't got capacity to work on this yet. The issue still exists. Will take a look in the next few weeks.
FWICT, every error path in the local proc launch includes an ORTE_ERROR_LOG
- yet I'm not seeing that in your error case. You might want to ensure that you configured with --enable-debug
so you get the full debug log. You can also turn off the verbose settings except for odls_base_verbose
and set that one to 100.
For some reason, you are getting an error when trying to locally start a process on one of the nodes. The only place the binding comes into play would be here (starting at line 767):
/* compute and save bindings of local children */
if (ORTE_SUCCESS != (rc = orte_rmaps_base_compute_bindings(jdata))) {
ORTE_ERROR_LOG(rc);
goto REPORT_ERROR;
}
Again, you should see that error log - but maybe it is a "silent" error that expected orte_rmaps_base_compute_bindings
to print out an error. You might need to put an explicit output line there to see if you are getting an error.
@wenduwan Is this still an issue?
I will try again today/tomorrow and report back.
@jsquyres I checked 4.1.6 and this problem still exists.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.5 relese tarball, configured with
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
4.1.5 release tarball
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
Details of the problem
Encountered ORTE issue when running the command on head node(36 cores). It works when I run on a compute node though.
I manually verified two-way tcp traffic by ssh'ing from each compute node to head node. Did not see any issue.
However, I was able to mitigate by providing
--bind-to core/socket
.I got hints from this post and followed @rhc54 's suggestion to add debugging info. Finally found something interesting. This line failed.
I have to provide these flags:
In the log, the problematic rank seemingly died around here
However, on a successful run(by
--bind-to core
) the log is differentI'm not clear on the ORTE side, would appreciate some pointers to understand the exact problem.
TIA!