openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

How to tune the short, buffered, zcopy ranges on a single node(sm)? #9162

Open arunedarath opened 1 year ago

arunedarath commented 1 year ago

Hi,

On my system, the ucx_env has the below threshold values.

UCX_MEM_DYNAMIC_MMAP_THRESH=y UCX_TCP_SENDV_THRESH=2K UCX_BCOPY_THRESH=auto UCX_RNDV_THRESH=auto UCX_RNDV_SEND_NBR_THRESH=256K UCX_RNDV_THRESH_FALLBACK=inf UCX_ZCOPY_THRESH=auto UCX_TM_THRESH=1K UCX_TM_FORCE_THRESH=8K UCX_RNDV_PIPELINE_SEND_THRESH=inf UCX_RNDV_ALIGN_THRESH=64K

When I run a simple mpi program with debug the values chosen are(0-92, 93-8248, 8249-262143, 262144-inf)

[1687695876.298579] [lib-ssp-04:3169896:0]   +----------------------------+----------------------------------------------------------------------+
[1687695876.298581] [lib-ssp-04:3169896:0]   | 0x160e1e0 intra-node cfg#1 | tagged message by ucp_tag_send*(fast-completion) from host memory    |
[1687695876.298582] [lib-ssp-04:3169896:0]   +----------------------------+-------------------------------------------------------+--------------+
[1687695876.298584] [lib-ssp-04:3169896:0]   |                      0..92 | eager short                                           | sysv/memory  |
[1687695876.298585] [lib-ssp-04:3169897:0]   |                   93..8248 | eager copy-in copy-out                                | sysv/memory  |
[1687695876.298587] [lib-ssp-04:3169897:0]   |               8249..262143 | multi-frag eager copy-in copy-out                     | sysv/memory  |
[1687695876.298589] [lib-ssp-04:3169897:0]   |                  256K..inf | (?) rendezvous copy from mapped remote memory         | xpmem/memory |
[1687695876.298593] [lib-ssp-04:3169897:0]   +----------------------------+-------------------------------------------------------+--------------+
[1687695876.298764] [lib-ssp-04:3169897:0]   +----------------------------+--------------------------------------------------------------+

How are the above values calculated? How do I choose the range that gives the best data transfer rates(single node shared memory)? (Or leave the tunable parameters as auto?)

--Arun

yosefe commented 1 year ago

The 'auto' thresholds are calculated according to internal performance estimation of what would be faster according to fabric speed, estimated memory copy speed, etc. For MPI_Send, the rndv threshold is calculated according to UCX_RNDV_SEND_NBR_THRESH. In general it's recommended to keep the thresholds as 'auto'

arunedarath commented 1 year ago

@yosefe Thanks for the reply.

Values: #

tag_send: 0..<egr/short>..93..<egr/bcopy>..8256....(inf)

tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..262144....(inf)

tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8256....(inf)

From the sources, I can see that the value of eager_short is determined at compile-time (If corresponding env values are 'auto') iface_attr->cap.am.max_short = iface->config.fifo_elem_size - sizeof(uct_mm_fifo_element_t);

I can only reduce this value by using a small value for UCX_BCOPY_THRESH

And the value for max_bcopy is also determined at compile-time(If corresponding env values are 'auto') iface_attr->cap.am.max_bcopy = iface->config.seg_size;

These values are initialized in ucp_ep_config_init() and it comes from the below table

`ucs_config_field_t uct_mm_iface_configtable[] = { {"SM", "ALLOC=md,mmap,heap;BW=15360MBs", NULL, ucs_offsetof(uct_mm_iface_config_t, super), UCS_CONFIG_TYPE_TABLE(uct_sm_iface_config_table)},

{"FIFO_SIZE", "64",
 "Size of the receive FIFO in the memory-map UCTs.",
 ucs_offsetof(uct_mm_iface_config_t, fifo_size), UCS_CONFIG_TYPE_UINT},

{"SEG_SIZE", "8256",
 "Size of send/receive buffers for copy-out sends.",
 ucs_offsetof(uct_mm_iface_config_t, seg_size), UCS_CONFIG_TYPE_MEMUNITS},

{"FIFO_RELEASE_FACTOR", "0.5",      
 "Frequency of resource releasing on the receiver's side in the MM UCT.\n"
 "This value refers to the percentage of the FIFO size. (must be >= 0 and < 1).",
 ucs_offsetof(uct_mm_iface_config_t, release_fifo_factor), UCS_CONFIG_TYPE_DOUBLE},

UCT_IFACE_MPOOL_CONFIG_FIELDS("RX_", -1, 512, 128m, 1.0, "receive",
                              ucs_offsetof(uct_mm_iface_config_t, mp), ""),

{"FIFO_HUGETLB", "no",                
 "Enable using huge pages for internal shared memory buffers."
 "Possible values are:\n"             
 " y   - Allocate memory using huge pages only.\n"
 " n   - Allocate memory using regular pages only.\n"
 " try - Try to allocate memory using huge pages and if it fails, allocate regular pages.",
 ucs_offsetof(uct_mm_iface_config_t, hugetlb_mode), UCS_CONFIG_TYPE_TERNARY},

{"FIFO_ELEM_SIZE", "128",
 "Size of the FIFO element size (data + header) in the MM UCTs.",
 ucs_offsetof(uct_mm_iface_config_t, fifo_elem_size), UCS_CONFIG_TYPE_UINT},

{"FIFO_MAX_POLL", UCS_PP_MAKE_STRING(UCT_MM_IFACE_FIFO_MAX_POLL),
 "Maximal number of receive completions to pick during RX poll",
 ucs_offsetof(uct_mm_iface_config_t, fifo_max_poll), UCS_CONFIG_TYPE_ULUNITS},

{"ERROR_HANDLING", "n", "Expose error handling support capability",
 ucs_offsetof(uct_mm_iface_config_t, error_handling), UCS_CONFIG_TYPE_BOOL},

{NULL}

};
`

Please correct me if the above understanding is wrong. I am bringing up this point because I am not able to find out any code that does a dynamic calculation of the above ranges based on the type of CPU (Milan, Genoa, etc) or memory copy speed

--Arun

yosefe commented 1 year ago

@arunedarath we've just fixed the ucx_info report in master branch, can you pls check it again? The code for dynamic selection is in src/ucp/proto, mostly proto_select.c and proto_init.c

arunedarath commented 1 year ago

@yosefe Yes the o/p format is changed now, given below.

UCP context

 component 0  :  self
 component 1  :  tcp
 component 2  :  sysv
 component 3  :  posix
 component 4  :  cma
 component 5  :  xpmem

        md 0  :  component 2  sysv 
        md 1  :  component 3  posix 
        md 2  :  component 4  cma 
        md 3  :  component 5  xpmem 

  resource 0  :  md 0  dev 0  flags -- sysv/memory
  resource 1  :  md 1  dev 0  flags -- posix/memory
  resource 2  :  md 2  dev 0  flags -- cma/memory
  resource 3  :  md 3  dev 0  flags -- xpmem/memory

memory: 0.00MB, file descriptors: 6 create time: 1.254 ms

UCP worker 'lib-ssp-04:3366309'

             address: 180 bytes

memory: 0.00MB, file descriptors: 5 create time: 1.939 ms

UCP endpoint

           peer: lib-ssp-04:3366309
             lane[0]:  0:sysv/memory.0 md[0]           -> md[0]/sysv/sysdev[255] am am_bw#0
             lane[1]:  3:xpmem/memory.0 md[3]          -> md[3]/xpmem/sysdev[255] rkey_ptr
             lane[2]:  2:cma/memory.0 md[2]            -> md[2]/cma/sysdev[255] rma_bw#0

+---------------------+----------------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send*(fast-completion) from host memory | +---------------------+-------------------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..8248 | eager copy-in copy-out | sysv/memory | | 8249..262143 | multi-frag eager copy-in copy-out | sysv/memory | | 256K..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-------------------------------------------------------+--------------+

+---------------------+--------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send*(multi) from host memory | +---------------------+-----------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..6212 | eager copy-in copy-out | sysv/memory | | 6213..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-----------------------------------------------+--------------+

+---------------------+--------------------------------------------------------------+ | ucx_info self cfg#0 | tagged message by ucp_tag_send* from host memory | +---------------------+-----------------------------------------------+--------------+ | 0..92 | eager short | sysv/memory | | 93..3367 | eager copy-in copy-out | sysv/memory | | 3368..inf | (?) rendezvous copy from mapped remote memory | xpmem/memory | +---------------------+-----------------------------------------------+--------------+