Open miharulidze opened 4 years ago
Yea, both these features are fully supported. Let me first talk about the second one, the configuration file passed through the coll_tuned_dynamic_rules_filename
MCA parameter. The format is described in the paper you pointed, but there are many examples in our mailing list. It supports all collectives provided by tuned, for as long as tuned is the module selected for a particular collective (you can enforce this by setting tuned priority to 100). For the sake of completeness here is another example that fiddle with the MPI_Alltoall and MPI_Allreduce collectives:
2 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
# 0: ignore, 1: basic linear, 2: pairwise, 3: modified bruck,
# 4: linear with sync, 5:two proc only
1 # number of com sizes
64 # comm size 64
2 # number of msg sizes
1024 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective
2 # ID = 2 Allreduce collective (ID in coll_tuned.h)
1 # number of com sizes
8 # comm sizes 8
2 # number of msg sizes
0 1 0 0 # for message size 0, basic linear 1, topo 0, 0 segmentation
1024 2 0 0 # for messages size > 1024, nonoverlapping 2, topo 0, 0 segmentation
# end of collective rule
The first approach, via coll_tuned_<COLL NAME>_algorithm
might have less flexibility to precisely select collective for ranges of processes or message size, but can be dynamically changed during the execution by setting the corresponding MCA parameter right before creating a new communicator. As a result you can finely tailor the collective algorithm behind each collective on the communicators that matter most for your application.
@bosilca Thank you very much for fast reply! Your examples are very useful.
As far as I understand, collective operation IDs are not defined in coll_tuned.h explicitly, but they match the sequence number as they defined in this file, e.g. Allgather has ID 0, Barries has ID 5, etc.
Also, in your example with Alltoall operation:
1 # number of com sizes
64 # comm size 8
Is it a mistake? comment should look like # comm size 8
, or I missing something?
The same thing with Allreduce example:
1 # number of com sizes
1 # comm size 2
You're right we moved the collective id in base/coll_base_functions.h.
For the comments they might indeed not be very accurate, I play with the files and missed to update the comments. I'll fix them in my answer.
@miharulidze all algorithms must specify a rule for message size of zero (https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/tuned/coll_tuned_dynamic_file.c#L200). Otherwise, coll/tuned will switch to fixed rules.
@bosilca , @mkurnosov Thank you for support!
Maybe it's a good idea to add some sort of generic template for such rules file to documentation? Something like this:
#######################################################################################
############################## RULES FILE TEMPLATE ####################################
#######################################################################################
COLLS_NUM # num of collectives for which rules are specified
#######################################################################################
############################## FIRST COLLECTIVE RULES #################################
#######################################################################################
# Start first collective rules
COLL_ID_1 # First collective operation ID
COMMS_NUM # number of communicator sizes for which you want to define rules
############################## FIRST COMMUNICATOR #####################################
COMM_SIZE_1 # Size of first communicator
MSG_SIZES_NUM # How many threshold do you want to scpecify for COMM_SIZE1?
# Should be at least 1 (for msg_size >= 0),
# otherwise rules file will be ignored
# Thresholds for COMM_SIZE_1
0 ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= N
# End of first communicator
############################## SECOND COMMUNICATOR #####################################
COMM_SIZE_2 # Size of second communicator
MSG_SIZES_NUM # How many threshold do you want to scpecify for COMM_SIZE2?
# Sould be at least 1 (for msg_size >= 0),
# otherwise rules file will be ignored
# Thresholds for COMM_SIZE2
0 ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= N
# End of second communicator
############################## Nth COMMUNICATOR #######################################
COMM_SIZE_N # Size of last (COMMS_NUMth) communicator
MSG_SIZES_NUM # How many thresholds do you want to scpecify for COMM_SIZE_N?
# At least 1 (for msg_size >= 0),
# otherwise rules file will be ignored
# Thresholds for COMM_SIZE_N
0 ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= 0
M ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= M
N ALG_NUM TOPO SEGM # Use ALG_NUM for msg_size >= N
# End of Nth communicator
# End of COLL_ID_1
#######################################################################################
############################## Nth COLLECTIVE RULES ###################################
#######################################################################################
# Define rules for next collective operation
@miharulidze @bosilca I suggest to add a list of the algorithms.
# List of collective communication algorithms (coll/tuned only):
# COMM_SIZE -- a communicator size
# COUNT -- an argument of a collective operation
# TYPESIZE -- a datatypes size
# SEGSIZE -- a segment size (SEGM in file-based rules)
# FANINOUT -- a degree of a tree (TOPO in file-based rules)
#
# MPI_ALLGATHER (COLL_ID 0)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Bruck
# 3 Recursive doubling (if non-power-of-two number of processes, it will switch to Bruck)
# 4 Ring
# 5 Neighbor Exchange (if odd number of processes, it will switch to Ring)
# 6 Two procs (COMM_SIZE = 2 only)
#
# ALLGATHERV (COLL_ID 1)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Bruck
# 3 Ring
# 4 Neighbor Exchange (if odd number of processes, it will switch to Ring)
# 5 Two procs (COMM_SIZE = 2 only)
#
# ALLREDUCE (COLL_ID 2)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear (reduce linear + bcast linear)
# 2 Nonoverlapping (reduce + bcast)
# 3 Recursive doubling
# 4 Ring (commutative ops only; if COUNT < COMM_SIZE, it will switch to Recursive doubling)
# 5 Segmented ring (commutative ops only; if COUNT < COMM_SIZE * SEGSIZE / TYPESIZE, it will switch to Ring)
# 6 Rabenseifner (if op is non-commutative or COUNT < pow(2, floor(log2(COMM_SIZE))), it will switch to Linear)
#
# ALLTOALL (COLL_ID 3)
# Note: if sbuf = MPI_IN_PLACE, algorithms will switch to Linear inplace
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Pairwise
# 3 Modified Bruck
# 4 Linear sync (with limited number of outstanding requests)
# 5 Two proc (COMM_SIZE = 2 only)
#
# ALLTOALLV (COLL_ID 4)
# Note: if sbuf = MPI_IN_PLACE, algorithms will switch to Linear Inplace
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Basic linear
# 2 Pairwise
#
# ALLTOALLW (COLL_ID 5) -- not yet implemented
#
# BARRIER (COLL_ID 6)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Double ring
# 3 Recursive doubling
# 4 Bruck
# 5 Two proc (COMM_SIZE = 2 only)
# 6 Tree
#
# BCAST (COLL_ID 7)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Basic linear
# 2 Chain (FANINOUT chains/pipelines with message segment of SEGSIZE bytes)
# 3 Pipeline (segment of SEGSIZE bytes)
# 4 Split binary tree (segment of SEGSIZE bytes; if SEGSIZE > COUNT/2*TYPESIZE, it will switch to Chain with FANINOUT=1)
# 5 Binary tree (segment of SEGSIZE bytes)
# 6 Binomial tree (segment of SEGSIZE bytes)
# 7 Knomial tree (segment of SEGSIZE bytes)
# 8 Scatter-allgather (if COUNT < COMM_SIZE, it will switch to Linear)
# 9 Scatter-allgather-ring (if COUNT < COMM_SIZE, it will switch to Linear)
#
# EXSCAN (COLL_ID 8)
# Alg ID Algorithm
# 0 Linear
# 1 Linear
# 2 Recursive doubling
#
# GATHER (COLL_ID 9)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Binomial tree
# 3 Linear sync (segment of SEGSIZE bytes)
#
# GATHERV (COLL_ID 10) -- not yet implemented
#
# REDUCE (COLL_ID 11)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Chain (FANINOUT chains/pipelines with message segment of SEGSIZE bytes)
# 3 Pipeline (segment of SEGSIZE bytes)
# 4 Binary tree (segment of SEGSIZE bytes
# 5 Binomial tree (segment of SEGSIZE bytes)
# 6 In-order binary tree (segment of SEGSIZE bytes)
# 7 Rabenseifner (if op is non-commutative or COUNT < pow(2, floor(log2(COMM_SIZE))), it will switch to Linear)
#
# REDUCESCATTER (COLL_ID 12)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Non-overlapping (reduce + scatterv)
# 2 Recursive halving (commutative ops only)
# 3 Ring (commutative ops only)
# 4 Butterfly
#
# REDUCESCATTERBLOCK (COLL_ID 13)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear (reduce + scatter)
# 2 Recursive doubling
# 3 Recursive halving (if op is non-commutative, it will switch to Linear)
# 4 Butterfly
#
# SCAN (COLL_ID 14)
# Alg ID Algorithm
# 0 Linear
# 1 Linear
# 2 Recursive doubling (commutative ops only)
#
# SCATTER (COLL_ID 15)
# Alg ID Algorithm
# 0 Use fixed rules
# 1 Linear
# 2 Binomial tree
# 3 Linear nb
#
# SCATTERV (COLL_ID 16) -- not yet implemented
# NEIGHBOR_ALLGATHER (COLL_ID 17) -- not yet implemented
# NEIGHBOR_ALLGATHERV (COLL_ID 18) -- not yet implemented
# NEIGHBOR_ALLTOALL (COLL_ID 19) -- not yet implemented
# NEIGHBOR_ALLTOALLV (COLL_ID 20) -- not yet implemented
# NEIGHBOR_ALLTOALLW (COLL_ID 21) -- not yet implemented
#
Based on prior experiences we are not really good at investing time in maintaining the documentation. Instead of listing the algorithms themselves I would add text explaining how a user can list all algorithms for each collective using ompi_info.
Not sure if it's just me or actually a bug in the file processing part of the code. I'm trying to play with self-defined dynamic rules for Scatter, because the current fixed decision has the following logic,
1551 if (total_dsize < 512) {
1552 alg = 2;
1553 } else {
1554 alg = 3;
1555 }
I'm simply trying to raise the switch point from 512B to 8192B. So I have the following definition file.
1 # num of collectives
# first collective
15 # ID = 15 Scatter collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 64
2 # number of msg sizes
0 2 0 0 # for message size 0, binomial, no topo or segmentation
8192 3 0 0 # for message size 8k+, linear nb, no topo or segmentation
# end of first collective
However when I try to run it, looks to me that only the first character of the message size was parsed, i.e., 8192->8. For example,
$ mpirun -mca pml ucx -mca coll_hcoll_enable 0 -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_dynamic_rules_filename $PWD/scatter_dyn.conf osu_scatter -f
# OSU MPI Scatter Latency Test v5.6.2
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 2.23 0.88 3.45 1000
2 2.74 1.02 4.14 1000
4 3.60 1.63 5.72 1000
8 178.76 0.18 372.34 1000
16 178.93 0.20 373.37 1000
32 184.07 0.31 381.15 1000
64 191.70 0.30 398.27 1000
128 210.57 0.51 421.86 1000
256 246.15 0.56 493.67 1000
512 259.43 0.68 518.93 1000
1024 284.47 0.92 564.12 1000
2048 338.04 1.27 661.85 1000
4096 468.41 1.94 909.64 1000
8192 725.54 3.35 1402.59 1000
16384 1273.38 4.32 2612.10 100
32768 2578.37 11.49 5301.16 100
65536 5637.87 20.81 11665.03 100
131072 6671.02 24.77 13351.72 100
262144 13379.54 30.36 26665.75 100
I did multiple tries and they all showed the same behavior. Am I missing something here?
Comm size is 64 in your file, so it shouldn’t be using your tuning vars.
@wckzhang Well, apparently it is using the rules if you look at the osu_scatter output I attached, just changing the switch point from 512B to 8B, instead of 8192B I intended to use. Also my understanding of the comm size is same as message size, which should be interpreted as "anything larger than comm size 64". I could be wrong though since I have not read the code. BTW, forgot to mention, I also played with multiple of comm sizes and no effect to above behavior.
Yeah, anything larger than comm size 64 would be using the message size. What comm size are you using? I’ve never seen the dynamic tuning detect the wrong number. The parsing code is fairly sensitive and if you have slight formatting errors it will disregard the file completely. The function for parsing config files is - ompi_coll_tuned_read_rules_config_file – you can fairly easily add some prints/turn on logging in this function to check if the file is being read properly. My best guess is that the file is being parsed incorrectly. I don’t know if the formatting is lost in e-mail, but the newline and # code (ompi_coll_base_file_getnext_long and skiptonewline) is a bit finicky. I really don’t see how - rc = fscanf(fptr, "%li", val); - could read 8192 as 8.
William
What comm size are you using?
My comm size is 1280 for above testing.
The parsing code is fairly sensitive and if you have slight formatting errors it will disregard the file completely.
This is exactly as I thought as well. If it is dropped completely then I know my format is wrong so I can fix it. But if you take a look at the result I posted, apparently it worked for 2 message size regions that I set. Just the second region started from 8B instead of 8192B.
I played a bit more with this and looks like my earlier comment on parsing the first character of the message size was wrong. It appears to behave like this because my comm size is 1280 so that was just a coincidence. The actual config file that gives me what I need looks like below,
1 # num of collectives
# first collective
15 # ID = 15 Scatter collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 64
2 # number of msg sizes
0 2 0 0 # for message size 0, binomial, no topo or segmentation
10485760 3 0 0 # ???
# end of first collective
The second range 10485760 is calculated from (8192 * 1280) with 8192 being the message range I'd like it to start from and 1280 is the comm size. Why so? I don't know because that still looks like a bug to me. My understanding is this value is comm size agnostic.
Ah...you're hitting this issue. So message size has a vague definition and I brought up an issue about this at one point on the discrepancies, let me see if I can find the issue.
I don't really like where we're at with the message sizes, but there's a table in that issue you can refer to for correct sizing.
There's also an issue in the collectives-tuning repo - https://github.com/open-mpi/ompi-collectives-tuning/issues/24 which also describes the discrepancy. Since the code to generate tuning files doesn't take this into account (I do remember writing that code but I never merged it I suppose).
Oh the table is a little outdated now since I revised scatter and gather to use datatype size com size s count
So this explains it. But I have to say that it is very counter-intuitive and inconsistent with existing fixed decision rules. For example, take a look at the code snippet that I posted for scatter.
1551 if (total_dsize < 512) {
1552 alg = 2;
1553 } else {
1554 alg = 3;
1555 }
If I just want to do similar with dynamic rule file, but simply change 512 to 8192 so that I can achieve something like: for all comm size that is >= 64, when total_dsize < 8192 use algo 2 and algo 3 for the rest. How to achieve that? With the current restriction of the file format, do we have to list all comm size and calculate the msg size for that?
Yeah I also think it's counter-intuitive and inconsistent since com size is already taken into account in the tuning file, why does it need to be taken into account again? Unfortunately there isn't a way to do that with the dynamic code today. @bosilca has major interests in this area, should we re-discuss the message size issue?
Agreed.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Distribution tarball from open-mpi.org.
./configure --prefix=$(pwd)/build --with-ucx=/path-to-ucx-installation/ --enable-orterun-prefix-by-default
Please describe the system on which you are running
Details of the problem
Dear OpenMPI developers,
I'm trying to provide tuned selection of collective algorithms for
tuned
component. It seems like there are two ways: 1) through command line options like--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_<COLL NAME>_algorithm <ALGORITHM ID>
. This method works fine and I notice the big difference between algorithms while running OSU benchmarks. At the same time, this method not allows me to do fine-grained tuning, like specifying a communicator size, message thresholds, etc. 2) Specify algorithm selection policy using file with the following options--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_dynamic_rules_filename <PATH TO ALGORITHM RULES>
. This paper (actually, this the only example of rules file I found in Google) shows an example of Alltoall algorithm selection tuning for different use cases. I also done some experiments with Alltoall, but it showed no difference between several algorithms/thresholds at all.Here is my questions: 1) Is this feature still supported in tuned component? 2) Are there some restrictions to use it, for example it works only with subset of collectives that are implemented in tuned component? 3) Are there any examples of rules file for other operations?
I'll be grateful for any help. Thank you in advance!