Open arunedarath opened 1 year ago
The 'auto' thresholds are calculated according to internal performance estimation of what would be faster according to fabric speed, estimated memory copy speed, etc. For MPI_Send, the rndv threshold is calculated according to UCX_RNDV_SEND_NBR_THRESH. In general it's recommended to keep the thresholds as 'auto'
@yosefe Thanks for the reply.
Values: #
From the sources, I can see that the value of eager_short is determined at compile-time (If corresponding env values are 'auto') iface_attr->cap.am.max_short = iface->config.fifo_elem_size - sizeof(uct_mm_fifo_element_t);
I can only reduce this value by using a small value for UCX_BCOPY_THRESH
And the value for max_bcopy is also determined at compile-time(If corresponding env values are 'auto') iface_attr->cap.am.max_bcopy = iface->config.seg_size;
These values are initialized in ucp_ep_config_init() and it comes from the below table
`ucs_config_field_t uct_mm_iface_configtable[] = { {"SM", "ALLOC=md,mmap,heap;BW=15360MBs", NULL, ucs_offsetof(uct_mm_iface_config_t, super), UCS_CONFIG_TYPE_TABLE(uct_sm_iface_config_table)},
{"FIFO_SIZE", "64",
"Size of the receive FIFO in the memory-map UCTs.",
ucs_offsetof(uct_mm_iface_config_t, fifo_size), UCS_CONFIG_TYPE_UINT},
{"SEG_SIZE", "8256",
"Size of send/receive buffers for copy-out sends.",
ucs_offsetof(uct_mm_iface_config_t, seg_size), UCS_CONFIG_TYPE_MEMUNITS},
{"FIFO_RELEASE_FACTOR", "0.5",
"Frequency of resource releasing on the receiver's side in the MM UCT.\n"
"This value refers to the percentage of the FIFO size. (must be >= 0 and < 1).",
ucs_offsetof(uct_mm_iface_config_t, release_fifo_factor), UCS_CONFIG_TYPE_DOUBLE},
UCT_IFACE_MPOOL_CONFIG_FIELDS("RX_", -1, 512, 128m, 1.0, "receive",
ucs_offsetof(uct_mm_iface_config_t, mp), ""),
{"FIFO_HUGETLB", "no",
"Enable using huge pages for internal shared memory buffers."
"Possible values are:\n"
" y - Allocate memory using huge pages only.\n"
" n - Allocate memory using regular pages only.\n"
" try - Try to allocate memory using huge pages and if it fails, allocate regular pages.",
ucs_offsetof(uct_mm_iface_config_t, hugetlb_mode), UCS_CONFIG_TYPE_TERNARY},
{"FIFO_ELEM_SIZE", "128",
"Size of the FIFO element size (data + header) in the MM UCTs.",
ucs_offsetof(uct_mm_iface_config_t, fifo_elem_size), UCS_CONFIG_TYPE_UINT},
{"FIFO_MAX_POLL", UCS_PP_MAKE_STRING(UCT_MM_IFACE_FIFO_MAX_POLL),
"Maximal number of receive completions to pick during RX poll",
ucs_offsetof(uct_mm_iface_config_t, fifo_max_poll), UCS_CONFIG_TYPE_ULUNITS},
{"ERROR_HANDLING", "n", "Expose error handling support capability",
ucs_offsetof(uct_mm_iface_config_t, error_handling), UCS_CONFIG_TYPE_BOOL},
{NULL}
};
`
Please correct me if the above understanding is wrong. I am bringing up this point because I am not able to find out any code that does a dynamic calculation of the above ranges based on the type of CPU (Milan, Genoa, etc) or memory copy speed
--Arun
@arunedarath we've just fixed the ucx_info report in master branch, can you pls check it again? The code for dynamic selection is in src/ucp/proto, mostly proto_select.c and proto_init.c
@yosefe Yes the o/p format is changed now, given below.
UCP context
component 0 : self
component 1 : tcp
component 2 : sysv
component 3 : posix
component 4 : cma
component 5 : xpmem
md 0 : component 2 sysv
md 1 : component 3 posix
md 2 : component 4 cma
md 3 : component 5 xpmem
resource 0 : md 0 dev 0 flags -- sysv/memory
resource 1 : md 1 dev 0 flags -- posix/memory
resource 2 : md 2 dev 0 flags -- cma/memory
resource 3 : md 3 dev 0 flags -- xpmem/memory
memory: 0.00MB, file descriptors: 6 create time: 1.254 ms
UCP worker 'lib-ssp-04:3366309'
address: 180 bytes
memory: 0.00MB, file descriptors: 5 create time: 1.939 ms
UCP endpoint
peer: lib-ssp-04:3366309
lane[0]: 0:sysv/memory.0 md[0] -> md[0]/sysv/sysdev[255] am am_bw#0
lane[1]: 3:xpmem/memory.0 md[3] -> md[3]/xpmem/sysdev[255] rkey_ptr
lane[2]: 2:cma/memory.0 md[2] -> md[2]/cma/sysdev[255] rma_bw#0
+---------------------+----------------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send*(fast-completion) from host memory | +---------------------+-------------------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..8248 | eager copy-in copy-out | sysv/memory | | 8249..262143 | multi-frag eager copy-in copy-out | sysv/memory | | 256K..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-------------------------------------------------------+--------------+
+---------------------+--------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send*(multi) from host memory | +---------------------+-----------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..6212 | eager copy-in copy-out | sysv/memory | | 6213..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-----------------------------------------------+--------------+
+---------------------+--------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send* from host memory | +---------------------+-----------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..3367 | eager copy-in copy-out | sysv/memory | | 3368..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-----------------------------------------------+--------------+
Hi,
On my system, the ucx_env has the below threshold values.
UCX_MEM_DYNAMIC_MMAP_THRESH=y UCX_TCP_SENDV_THRESH=2K UCX_BCOPY_THRESH=auto UCX_RNDV_THRESH=auto UCX_RNDV_SEND_NBR_THRESH=256K UCX_RNDV_THRESH_FALLBACK=inf UCX_ZCOPY_THRESH=auto UCX_TM_THRESH=1K UCX_TM_FORCE_THRESH=8K UCX_RNDV_PIPELINE_SEND_THRESH=inf UCX_RNDV_ALIGN_THRESH=64K
When I run a simple mpi program with debug the values chosen are(0-92, 93-8248, 8249-262143, 262144-inf)
How are the above values calculated? How do I choose the range that gives the best data transfer rates(single node shared memory)? (Or leave the tunable parameters as auto?)
--Arun