oneapi-src / unified-memory-framework

A library for constructing allocators and memory pools. It also contains broadly useful abstractions and utilities for memory management. UMF allows users to manage multiple memory pools characterized by different attributes, allowing certain allocation types to be isolated from others and allocated using different hardware resources as required.
https://oneapi-src.github.io/unified-memory-framework/
Other
41 stars 29 forks source link

Backward compatibility for configs of pools/providers #844

Open vinser52 opened 4 weeks ago

vinser52 commented 4 weeks ago

Rationale

UMF should provide backward compatible interfaces.

Description

Background

MPI team experienced the issue after the PR #692 extends level_zero_memory_provider_params_t config structure. The root cause of the issue relates to how MPI instantiate/initialize the L0 provider config. They did the following:

level_zero_memory_provider_params_t l0_config = {
    .level_zero_context_handle = hContext;
    .level_zero_device_handle = hDevice;
    .memory_type = UMF_MEMORY_TYPE_DEVICE;
};

umf_memory_provider_handle_t hProvider = NULL;
umf_memory_provider_ops_t *l0_ops = umfLevelZeroMemoryProviderOps();
umfMemoryProviderCreate(l0_ops, &l0_config, &hProvider);

After PR #692 the code above start crashing because two new fields were added to the level_zero_memory_provider_params_t data structure and the example above does not initialize these new fields:

typedef struct level_zero_memory_provider_params_t {
    ze_context_handle_t level_zero_context_handle;
    ze_device_handle_t level_zero_device_handle;
    umf_usm_memory_type_t memory_type;

    // New fields
    ze_device_handle_t *resident_device_handles;
    uint32_t resident_device_count;
} level_zero_memory_provider_params_t;

A quick fix for the issue was to init the config data structure with 0 and than assign the required fields:

level_zero_memory_provider_params_t l0_config = { 0 }; // zero-initialize
l0_config.level_zero_context_handle = hContext;
l0_config.level_zero_device_handle = hDevice;
l0_config.memory_type = UMF_MEMORY_TYPE_DEVICE;

Open Questions

The quick fix above works only because it is OK to init fields of the level_zero_memory_provider_params_t structure with zeroes. But there are 2 related major question we should address:

1. How to initialize the configs with default values?

In general case, not every field of the config data structure could/should be initialized with zeroes.

2. How to support backward compatibility?

Config data structures are defined in the headers. If application was built with old version of UMF but on the system there is a newer version than even the application initialize all config's fields properly it initializes only fields that exists in the old version of UMF. Consider an example. There is a provider_foo and corresponding foo_config_t structure that contains an int field in the 1st version:

// provider_foo.h:
typedef struct foo_params_t {
    int field 1;
} foo_config_t;

// provider_foo.c:
umf_result_t foo_provider_initialize(void *params, void **provider) {
    foo_params_t *foo_params = (foo_params_t *)params;
    // expected sizeof(foo_params_t) is equal sizeof(int) that is 4 bytes

    assert(foo_params->field1 == 0);
}

// application code:
int main() {
    foo_params_t foo_params = {0};

    umf_memory_provider_handle_t hProvider = NULL;
    umf_memory_provider_ops_t *foo_ops = umfFooMemoryProviderOps();
    umfMemoryProviderCreate(foo_ops, &foo_params, &hProvider);
}

Now in the 2nd version of UMF we extend the foo_params_t structure with additional field:

// provider_foo.h:
typedef struct foo_params_t {
    int field 1;
    int field2;
} foo_config_t;

// provider_foo.c:
umf_result_t foo_provider_initialize(void *params, void **provider) {
    foo_params_t *foo_params = (foo_params_t *)params;
    // expected sizeof(foo_params_t) is equal 2*sizeof(int) that is 8 bytes

    assert(foo_params->field1 == 0);
    assert(foo_params->field2 == 0); // ERROR: if application was compiled with 1st version of the UMF lib then the size of memory pointed by `params` is 4 bytes
}

Possible API Changes

Option 1: handle-based approach

Do not expose the config structure in interfaces. The config object is allocated inside libumf.so and handle to the config is returned to the client code. Setter/getter APIs are used to setup config parameters. For example:

foo_params_handle_t foo_params = umfCreateFooParams();
umfFooParamsSetField1(foo_params, 7);

umf_memory_provider_handle_t hProvider = NULL;
umf_memory_provider_ops_t *foo_ops = umfFooMemoryProviderOps();
umfMemoryProviderCreate(foo_ops, foo_params, &hProvider);

Client code does not depend on the layout of the foo_params_t structure only UMF knows it and can change in different versions.

Option 2: explicitly pass the size of params

Data structures that define configs for the pools/providers remains in headers (the same as today). The new fields can be added only to the end. Application explicitly passes the size of the config data structure. Provider/pool implementation determine the version of the config based on the size. To init the config data structure we need to introduce a special macros or header-based inline functions. For example:

// Version 1 provider_foo.h:
typedef struct foo_params_t {
    int field 1;
} foo_config_t;

inline void umfFooParamsInit(foo_params_t *params) {
     params->field1 = 0;
}

// Version 2 provider_foo.h:
typedef struct foo_params_t {
    int field 1;
    int field2;
} foo_config_t;

inline void umfFooParamsInit(foo_params_t *params) {
     params->field1 = 0;
     params->field2 = 0;
}

// Version 2 provider_foo.c:
umf_result_t foo_provider_initialize(void *params, size_t params_size, void **provider) {
    foo_params_t foo_params;
    umfFooParamsInit(& foo_params);

    if(sizeof(foo_params_t) == params_size) {
         // current version, just copy input params as is.
        foo_params = *(foo_params_t*)params;
    } else {
        // old version, copy only first `params_size` bytes and keep the rest default initialized
        memcpy(& foo_params, params, params_size);
    }

    assert(foo_params->field1 == 0);
    assert(foo_params->field2 == 0);
}

// application code:
int main() {
    foo_params_t foo_params;
    umfFooParamsInit(&params);

    umf_memory_provider_handle_t hProvider = NULL;
    umf_memory_provider_ops_t *foo_ops = umfFooMemoryProviderOps();
    umfMemoryProviderCreate(foo_ops, &foo_params, sizeof(foo_params_t), &hProvider);
}

Option 3: store version in the params data structure

Similar to Option 2, but instead of using the size of params data structure to determine the version store it explicitly as a first field in the params data structure.

// Version 1 provider_foo.h:
typedef struct foo_params_t {
    int version;
    int field 1;
} foo_config_t;

inline void umfFooParamsInit(foo_params_t *params) {
    version = 1;
    params->field1 = 0;
}

// Version 2 provider_foo.h:
typedef struct foo_params_t {
    int field 1;
    int field2;
} foo_config_t;

inline void umfFooParamsInit(foo_params_t *params) {
    version = 2;
    params->field1 = 0;
    params->field2 = 0;
}

// Version 2 provider_foo.c:
umf_result_t foo_provider_initialize(void *params, size_t params_size, void **provider) {
    foo_params_t foo_params;
    umfFooParamsInit(& foo_params);

    int params_version = *(int *)params;

    switch(params_version) {
        case 1:
            ...
            break;
        case 2:
            ...
            break;
        default:
            LOG_ERR("Wrong version");
    }

    assert(foo_params->field1 == 0);
    assert(foo_params->field2 == 0);
}

// application code:
int main() {
    foo_params_t foo_params;
    umfFooParamsInit(&params);

    umf_memory_provider_handle_t hProvider = NULL;
    umf_memory_provider_ops_t *foo_ops = umfFooMemoryProviderOps();
    umfMemoryProviderCreate(foo_ops, &foo_params, &hProvider);
}

Design Considerations

I prefer Option 1 over Option 2 and Option 3. I think it provides best ABI compatibility. Another advantage is that the config data structure always initialized when user calls umfCreateFooParams(). There is no way to get non-initialized fields in the config, while in case of Option 2 or 3 user might forget to call umfFooParamsInit(&params) function.

Between Options 2 and 3, I prefer Option 3 because of two things:

Implementation details

TBD

vinser52 commented 4 weeks ago

We must address this issue before the first stable release because all three options require non-backward compatible changes.

pbalcer commented 4 weeks ago

:+1: to option 1.

Another option is to do what UR does and create a "chain" (linked list) of parameters.

struct foo_param {
  param_type type;
  int foo;
  void *next;
}

struct bar_param {
  param_type type;
  int bar;
  void *next;
}

struct foo_param foo = { PARAM_FOO, 5, nullptr };
struct bar_param bar = { PARAM_BAR, 10, &foo };

someApi(&bar); // this would apply all options in the chain.

This is super verbose, but really flexible. One downside is that it's on the api function to validate the parameters, and users need to remember to correctly set param type.

igchor commented 4 weeks ago

I also vote for option 1. What UR does is also OK but from my experience it's quite easy to forget setting the type (and we even had a bug in UMF where we set the type incorrectly).

One question: how's responsible for destroying the params - user or UMF?

vinser52 commented 3 weeks ago

One question: how's responsible for destroying the params - user or UMF?

In the case of option 1, User calls umfCreateFooParams() function which internally allocates required data structure. After pool/provider is created user should call the appropriate umfDestroyFooParams function to tell UMF library that params data structure should be destroyed.

lplewa commented 3 weeks ago

Option 1 > Option 3 >>>> Option 2

~The same issue we have with other structures in public header. Also we have the same issue with other structures, like memprovider ops - in case of those structures probably option 3 (version field) is the way.~

Edit. I checked code and i found that we already have version in ops structure so it is fine.