Instead of allocating a window for every call to ARMCI_Malloc, allocate a single slab according to GA_Initialize_ltd and suballocate from there. This will obviate the need for any GMR lookups, which helps latency.
A more general approach would be to use a slab of some capacity and then allocate a window per call for allocation in excess of that. Thus, the GMR lookup would start with a simple range check (fast) and only hit the O(n) linked-list traversal when the slab capacity was exceeded.
The key to implementing this is ARMCI_Set_shm_limit, which tells ARMCI what the upper bound on how much storage to allocate:
Global Arrays global/src/base.c line 374:
if(GA_memory_limited) ARMCI_Set_shm_limit(GA_total_memory);
if (_ga_initialize_c) {
if (_ga_initialize_args) {
ARMCI_Init_args(_ga_argc, _ga_argv);
}
else {
ARMCI_Init();
}
}
NWChem always uses ga_initialize_ltd and requires the user to specify the GA allocation size, so no changes to NWChem are required to activate this.
Instead of allocating a window for every call to
ARMCI_Malloc
, allocate a single slab according toGA_Initialize_ltd
and suballocate from there. This will obviate the need for any GMR lookups, which helps latency.A more general approach would be to use a slab of some capacity and then allocate a window per call for allocation in excess of that. Thus, the GMR lookup would start with a simple range check (fast) and only hit the O(n) linked-list traversal when the slab capacity was exceeded.
The key to implementing this is
ARMCI_Set_shm_limit
, which tells ARMCI what the upper bound on how much storage to allocate:Global Arrays
global/src/base.c
line 374:NWChem always uses
ga_initialize_ltd
and requires the user to specify the GA allocation size, so no changes to NWChem are required to activate this.Migrated from https://github.com/jeffhammond/armci-mpi/issues/21