open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

dead coll tuned alltoall mca parameters #12547

Open burlen opened 6 months ago

burlen commented 6 months ago

Background information

I'm tuning OpenMPI's alltoall on our cluster.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled from sources, on tag v5.0.3

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

33e93469e1e1f69904ff3e3827394719aa6b3671 3rd-party/openpmix (v5.0.2) 3a70fac9a21700b31c4a9f9958afa207a627f0fa 3rd-party/prrte (v3.0.5) dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

The following mca parameters relating to alltoall in coll tuned component are reported by ompi_info but are not used in the code.

coll_tuned_alltoall_small_msg
coll_tuned_alltoall_intermediate_msg
coll_tuned_alltoall_algorithm_segmentsize
coll_tuned_alltoall_algorithm_tree_fanout
coll_tuned_alltoall_algorithm_chain_fanout
coll_tuned_alltoall_large_msg                                                                                                        
coll_tuned_alltoall_min_procs

These may be remnants of a previous implementation, but currently setting these does nothing. Their presence complicates tuning. They should either be removed or at least the description string reported by ompi_info modified so that it's clear that these do nothing.

janjust commented 6 months ago

Can't remove in v5.0.x but can add a message, and those that are not used, to be removed in main There is a PR coming in where a parameter or two is used.

wenduwan commented 6 months ago

https://github.com/open-mpi/ompi/pull/12453/files Will use the fanout parameter(there are 2 fanouts).

burlen commented 5 months ago

@wenduwan would it make sense to use a different mca parameter name to avoid confusion with the original use of the mca fan out parameters? while these features are not very well documented, I assume it is documented somewhere and could thus be confusing with your new use. Assuming the two implementations are different

wenduwan commented 5 months ago

@burlen I'm actually fuzzy on the difference between fanout vs radix. Do you mind providing clarity on the terminology?

burlen commented 5 months ago

@burlen I'm actually fuzzy on the difference between fanout vs radix. Do you mind providing clarity on the terminology?

@wenduwan The Bruck algorithm completes in logarithmic number of rounds. Radix refers to the base of the logarithm. This is the terminology the author of the PR you linked uses for function argument names, and is used in the paper I linked in the review. It is really not about radix vs fanout. The problem is fanout is already in use for a different purpose, which makes overloading it as done in the PR potentially confusing.

wenduwan commented 5 months ago

@burlen I see.

From an engineering perspective, we also need to consider code maintainability. This issue arises from the introduction of additional parameters, which I believe had good reasons at some point - similar to this discussion.

For fan-out vs radix, do you see a case where both can be used together? Is it sufficient to simply document what the parameter means in different algorithms rather than creating an entirely new parameter?

burlen commented 5 months ago

For fan-out vs radix, do you see a case where both can be used together?

Since faninout is not even currently used in any of the alltoall implementations, I don't see how it could be

Is it sufficient to simply document what the parameter means in different algorithms rather than creating an entirely new parameter?

That would certainly be helpful. In my opinion it would be better to have the parameter name directly relate to the algorithm. for both maintainability and usability. the name itself can be part of the documentation (helps both users and maintainers). Such naming could even reduce the burden of writing documentation because you could cite the paper for more information

It's kind of a minor thing, I guess I'm a bit hung up on it (apologies!), because I'm finding these tuning features to be somewhat poorly documented and as a result rather difficult to use.

wenduwan commented 5 months ago

@burlen Thanks for the followup. In this case I suggest we postpone the decision to introduce a new parameter, and enrich the documentation for now. It is easier to add features than deprecate them afterwards.