jatinchowdhury18 / RTNeural

Real-time neural network inferencing
BSD 3-Clause "New" or "Revised" License
582 stars 58 forks source link

RFC: Allow use of abstract ModelT #88

Closed falkTX closed 1 year ago

falkTX commented 1 year ago

Me and @KaisKermani have been testing RTNeural in the context of https://github.com/AidaDSP/aidadsp-lv2 plugin, trying to find a way to make it load json files at runtime without losing performance compared to the statically constructed model types. One approach that proved to work is to make an abstract Model class and use that as the pointer type to which the plugin calls into, which allows stuff like this:

// in plugin header
AbstractModelT* dynamicmodel;

// in runtime, after loading a json file to figure out what architecture it needs
dynamicmodel = new RTNeural::ModelT<float, 1, 1, RTNeural::LSTMLayerT<float, 1, 40>, RTNeural::DenseT<float, 40, 1>>;

Compared to the more dynamic handling of the json namespace methods, this approach gives a performance equal to creating the model with all parameters set in the code (kinda expected, as it does pass all template params in a static way). Our idea is to have a verbose/extended json parser that then creates all known architectures possible in this "static" way, thus not losing performance.

This is a first test patch, meant for general feedback. A proper patch would at least need the abstract class to have a template for float/double/etc type. And I am not sure what to call that new forwardf call, we just need it to not be "forward" as to avoid recursion and name clashes.

What do you think?

jatinchowdhury18 commented 1 year ago

So just to make sure I'm correctly understanding the proposed use case, the idea is that you want to have a system that can dynamically create/load a ModelT for any model that falls within a subset of allowed architectures? Of course it wouldn't be possible to create a system that can do this sort of things for all network architectures that RTNeural supports, since there are infinite such architectures.

Given that this is the problem you're trying to solve. I don't think the proposed solution is the best way to go. Part of the design philosophy for ModelT (and for all of the SomethingT layers), is that we should be able to use them without requiring heap allocation or function calls through an abstract interface. It seems to me that the proposed solution would go against this philosophy, but again please correct me if I'm misunderstanding.

The way I have done this sort of thing in the past is as follows:

using ModelType1 = RTNeural::ModelT<...>;
using ModelType2 = RTNeural::ModelT<...>;
using ModelVariant = std::variant<ModelType1, ModelType2, ...>;
ModelVariant model;

void custom_model_loader (const nolohmann::json& model_json, ModelVariant& model)
{
  if (is_model_type1 (model_json))
    model.emplace<ModelType1>();
  else if (is_model_type2 (model_json))
    model.emplace<ModelType2>();

  std::visit ([&model_json] (auto&& chosen_model) { chosen_model.parseJson(model_json); }, model);
}

void process_audio (float* data, int num_samples)
{
  std::visit ([&data, num_samples] (auto&& chosen_model)
    {
      for (int n = 0; n < num_samples; ++n)
        data[n] = chosen_model.forward (data[n]);
    }, model);
}

This strategy allows the model to be created in local memory rather than heap memory, and avoids the need for an abstract function call in the innermost processing loop. If you don't care for std::variant or if you're unable to use C++17, then there are alternatives available, but the same strategy should still apply. I haven't gone this far myself, but I think it should be possible to programmatically generate some of the code for constructing the variant type and determining the model type to load.

Anyway, if you can think of a way to make some helper functions or classes to encapsulate this logic, I'd be happy to add that to RTNeural. At the very least, I think it would be a good idea for me to add an example and some documentation to let folks know about this strategy.

falkTX commented 1 year ago

So just to make sure I'm correctly understanding the proposed use case, the idea is that you want to have a system that can dynamically create/load a ModelT for any model that falls within a subset of allowed architectures?

Yes, supporting as much as we can write in verbose code that would check known conditions. Basically trying to get the benefits of static initialization while allowing a wide range of json parameters.

Part of the design philosophy for ModelT (and for all of the SomethingT layers), is that we should be able to use them without requiring heap allocation or function calls through an abstract interface. It seems to me that the proposed solution would go against this philosophy, but again please correct me if I'm misunderstanding.

For the plugin in question we actually want to do heap allocation, which is done in worker threads synchronized with the plugin host. We do a pointer swap when loading new models, the entire thing being thread-safe and lock-free as it is integrated with the host threads (on LV2 this can be done with the worker extension, CLAP has similar concepts too).

Basically:

  1. We have a single active model pointer instance running
  2. User selects a new file, received in LV2 via message on the process thread, we ask for a non-RT worker context
  3. New file is loaded on this new non-RT worker context, pointer is given to host again for swapping on RT side
  4. on next RT process we do pointer swap, and give old instance pointer to yet another non-RT worker context for deletion

For this to work we really need to be working with direct pointers.

I haven't gone this far myself, but I think it should be possible to programmatically generate some of the code for constructing the variant type and determining the model type to load.

From what we have tested, this approach has a significant performance hit, and becomes unsuitable for the plugin (we are running it in a 1.3GHz ARM processor after all, there is not a whole lot of CPU to go around...)

The virtual methods was my first initial thought of something that could work for a generic loader plugin. I am open to other suggestions that can:

  1. Be used as heap memory, as to be able to do a pointer swap (data to be copied to/from RT and non-RT contexts needs to be plain-old-data, polymorphic C++ stuff is incompatible with LV2 and CLAP worker APIs)
  2. Has a generic interface that does not consume a lot of memory (e.g. allocating all possible architecture types is a no-go, doesnt scale)

Using a union of pointers (or std::variant for a more modern C++ take on it) could be an option, but then every run would need to have a big switch case to find the right model instance type to call.

jatinchowdhury18 commented 1 year ago

Okay great, it seems like we are talking about essentially the same use-case :).

For the plugin in question we actually want to do heap allocation, which is done in worker threads synchronized with the plugin host. We do a pointer swap when loading new models, the entire thing being thread-safe and lock-free as it is integrated with the host threads (on LV2 this can be done with the worker extension, CLAP has similar concepts too).

Sure, so my point is that any solution built-in to RTNeural::ModelT should not require heap allocation. Of course, folks implementing their own systems are free to allocate ModelTs on the heap if/when it makes sense to do so.

That said, I think that in this case, the non-heap-allocated variant-based solution still makes the most sense. For example, if you ignore host-managed threads, you could do use ints and atomic_ints as shown below (untested). I imagine that with host-managed threads, things could become even simpler.

using ModelType1 = RTNeural::ModelT<...>;
using ModelType2 = RTNeural::ModelT<...>;
using ModelVariant = std::variant<ModelType1, ModelType2, ...>;
ModelVariant models [2];
std::atomic_int active_model_index { 0 };
int inactive_model_index { 1 };

// on some non-real-time thread...
void custom_model_loader (const nolohmann::json& model_json)
{
  auto& inactive_model = models[inactive_model_index];
  if (is_model_type1 (model_json))
    inactive_model.emplace<ModelType1>();
  else if (is_model_type2 (model_json))
    inactive_model.emplace<ModelType2>();

  std::visit ([&model_json] (auto&& chosen_model) { chosen_model.parseJson(model_json); }, inactive_model);

  // swap models here!
  inactive_model = active_model.exchange (inactive_model);
}

// on the real-time thread...
void process_audio (float* data, int num_samples)
{
  std::visit ([&data, num_samples] (auto&& chosen_model)
    {
      for (int n = 0; n < num_samples; ++n)
        data[n] = chosen_model.forward (data[n]);
    }, models[active_model.load()]);
}

From what we have tested, this approach has a significant performance hit, and becomes unsuitable for the plugin (we are running it in a 1.3GHz ARM processor after all, there is not a whole lot of CPU to go around...)

Would it be possible to explain a little bit more where this performance hit is coming from? From my own testing, I haven't been able to measure any performance difference between a plain ModelT and a std::variant<ModelT<...>, ...> with the same internal network architecture. That said if you're referring to the performance hit of how long it takes to load a model, that's not something I've measured before. If you're referring to some performance hit associated with programmatically generating the code for the std::variant that's not something I've actually tried before, but again I'd be curious to know where/why exactly the performance bottleneck is happening.

By contrast, I have found that constructing a ModelT on the heap can cause a measurable performance hit.

The virtual methods was my first initial thought of something that could work for a generic loader plugin. I am open to other suggestions that can:

  1. Be used as heap memory, as to be able to do a pointer swap (data to be copied to/from RT and non-RT contexts needs to be plain-old-data, polymorphic C++ stuff is incompatible with LV2 and CLAP worker APIs)
  2. Has a generic interface that does not consume a lot of memory (e.g. allocating all possible architecture types is a no-go, doesnt scale)

Sure, so I think the variant-based solution that I've proposed satisfies both these requirements. For requirement 1 (as I've shown above), it's possible to set it up so that the only data that would need to be copied between the real-time and non-real-time threads is a single array index. Or if you prefer, I guess you could use a pointer to a std::variant<ModelT<...>, ...>.

For requirement 2, in my experience, std::variant only requires approximately as much memory as the largest thing in the variant (see this example), although I guess that could depend on the implementation.

Using a union of pointers (or std::variant for a more modern C++ take on it) could be an option, but then every run would need to have a big switch case to find the right model instance type to call.

Right, so that's basically what I'm proposing, although I would say it's better to use the direct ModelT types rather than pointers. The massive switch-case can be tidied up for you with std::visit. The real benefit here (from a performance standpoint) is that the switch-case happens per-block rather than per-sample, which is essentially what you get with the polymorphic solution.


Anyway, I'm a bit busy this week, but I'll see if I can get a working example up and running sometime before the end of the week.

falkTX commented 1 year ago

that all looks very good and promising, thanks for the detailed info!

there is one crucial point though, the creation of the new model and its swap are not done on the same function. a non-RT worker context creates models, these are given to the host to later call ourselves again after plugin has finished processing (so the data structure needs to be POD compatible). it is this sequence that allows us to have lock-free rt-safe model switching, because the "pointer swap" is triggered by the host after the plugin has finished processing audio.

from your example code I dont see how we would create a model and have it pass through the host without needing to rely on the plugin class having it in some kind of variable for what to handle next. it is quite important that, when we create new models, we can pass this pointer/instance/whatever to the host directly without storing it ourselves (doing so leads to race conditions when multiple files are scheduled to be loaded quickly one after the other)

From what we have tested, this approach has a significant performance hit, and becomes unsuitable for the plugin (we are running it in a 1.3GHz ARM processor after all, there is not a whole lot of CPU to go around...)

Would it be possible to explain a little bit more where this performance hit is coming from? From my own testing, I haven't been able to measure any performance difference between a plain ModelT and a std::variant<ModelT<...>, ...> with the same internal network architecture

I meant when creating the model in a dynamic way, defining its architecture while parsing the json file.

For pointers or std::variant, performance should be pretty similar if not the same.

By contrast, I have found that constructing a ModelT on the heap can cause a measurable performance hit.

Can you give more information about this? From our testing the performance is pretty much the same using stack or heap allocated models.

Right, so that's basically what I'm proposing, although I would say it's better to use the direct ModelT types rather than pointers. The massive switch-case can be tidied up for you with std::visit. The real benefit here (from a performance standpoint) is that the switch-case happens per-block rather than per-sample, which is essentially what you get with the polymorphic solution.

Can you clarify this too? Do you mean that using my proposed abstract pointer solution would lead to per-sample processing?

jatinchowdhury18 commented 1 year ago

that all looks very good and promising, thanks for the detailed info!

No problem!

there is one crucial point though, the creation of the new model and its swap are not done on the same function. a non-RT worker context creates models, these are given to the host to later call ourselves again after plugin has finished processing (so the data structure needs to be POD compatible). it is this sequence that allows us to have lock-free rt-safe model switching, because the "pointer swap" is triggered by the host after the plugin has finished processing audio.

from your example code I dont see how we would create a model and have it pass through the host without needing to rely on the plugin class having it in some kind of variable for what to handle next. it is quite important that, when we create new models, we can pass this pointer/instance/whatever to the host directly without storing it ourselves (doing so leads to race conditions when multiple files are scheduled to be loaded quickly one after the other)

Right, so doing the swap inside the model-loading function was only part of my example because my example was assuming that host-managed worker threads don't exist. In cases where they do exist, I guess the correct solution would depend on how exactly they work, which I'm not immediately familiar with. From what I do know, I think my preferred solution would be:

But like I said, there are some details here that I'm not very familiar with, so I trust that you'll do the right thing for your use case. If you do end up using heap-allocated objects, I'd suggest using something like a ModelVariant* rather than something like a AbstractModelT* for the other reasons we've been discussing.

For pointers or std::variant, performance should be pretty similar if not the same.

By contrast, I have found that constructing a ModelT on the heap can cause a measurable performance hit.

Can you give more information about this? From our testing the performance is pretty much the same using stack or heap allocated models.

Sure, so a heap-allocated ModelT will typically need to be fetched from some (often "cold") memory before it can be used. In a lot of cases, the cost of fetching the relevant data is negligible relative to the cost of processing a block of audio, but for smaller models running at smaller block sizes, the difference is measurable. The most obvious case is in software like VCV Rack where the model will only ever be asked to process one sample at a time.

Right, so that's basically what I'm proposing, although I would say it's better to use the direct ModelT types rather than pointers. The massive switch-case can be tidied up for you with std::visit. The real benefit here (from a performance standpoint) is that the switch-case happens per-block rather than per-sample, which is essentially what you get with the polymorphic solution.

Can you clarify this too? Do you mean that using my proposed abstract pointer solution would lead to per-sample processing?

So one of the biggest performance gains between the Model and ModelT architectures comes from the fact that the compiler knows exactly what operations a ModelT::forward() method needs to do, which allows it to inline those operations and heavily optimize any per-sample loop that calls the function. By contrast, most of the operations in Model::forward() are hidden behind a virtual function call, so the compiler can't inline them and can't do much to optimize those inner loops. With something like an AbstractModelT::forwardf(), we would have the same issue. What I meant with my previous statement is that since a virtual function call is usually implemented via v-table lookup, it's not too different from doing a per-sample switch-case (actually, on some platforms the switch-case might even be faster).

jatinchowdhury18 commented 1 year ago

By the way, I just put together a little example project.

falkTX commented 1 year ago

Bringing some news..

I ran a few tests with different approaches, and basically:

the example project was extremely helpful, thanks a lot for that! specially the python script to generate a list of model types in a variant. based on it I made https://github.com/moddevices/aidadsp-lv2/blob/fancy-gui%2Bdynamic-models/variant/generate_variant_hpp.py which expands it a bit:

would be nice if we could extend ModelT to support a few more constexpr compatible fields, the aidadsp plugin in particular makes use of input_skip to decide if output buffer should be processed in replacing mode vs adding mode. not sure if this is a custom property or not, as I am not too familiar with the RTNeural json format.

on how to deal with LV2 worker contexts, for now I still ended up going with pointer swap technique and allocating the model on the heap. not by itself though, but along side other fields that relate to it. Right now the type is

struct DynamicModel {
    ModelVariantType variant;
    char* path;
    bool input_skip;
};

which allows to add more things to it as needed. so the final code for this is, in summary, this:

DynamicModel* RtNeuralGeneric::loadModel(..., const char* path)
{
    nlohmann::json model_json;
    /* (json loaded here) */

    std::unique_ptr<DynamicModel> model = std::make_unique<DynamicModel>();

    try {
        if (! custom_model_creator (model_json, model->variant))
            throw std::runtime_error ("Unable to identify a known model architecture!");

        std::visit (
            [&model_json] (auto&& custom_model)
            {
                using ModelType = std::decay_t<decltype (custom_model)>;
                if constexpr (! std::is_same_v<ModelType, NullModel>)
                {
                    custom_model.parseJson (model_json, true);
                    custom_model.reset();
                }
            },
            model->variant);
    }
    catch (const std::exception& e) {
        lv2_log_error(logger, "Error loading model: %s\n", e.what());
        return nullptr;
    }

    // save extra info
    model->path = strdup(path);
    /* etc */

    return model.release();
}

As this becomes a pointer, it is safe to pass around in the worker stuff. The char* path there might look weird, but we really only need to store the path to give to the host when requested, no other string operations are needed so feels wasteful to replace it with std::string or alike.

Personally I am happy with the results. Also means this PR is not needed in the end, but was great to start the discussion and see other (better) ways to achieve the same outcome.

PS: Just pushed Chow Centaur to the MOD plugin store <3 Discussion around the plugin in https://forum.mod.audio/t/chow-centaur/8045 The custom modgui was done by the people from aidadsp actually.

falkTX commented 1 year ago

Closing this ticket, the variant approach has proven to work well. Recently I needed to add a new shape size, it was as simple as adding a new value to the python script and running it again so it generated a new model_variant.hpp file. See https://github.com/AidaDSP/aidadsp-lv2/commit/d9c72e977d81f241320e4752eb2431716d02aef2

Great stuff! Thanks a lot for your help and a great project! <3

jatinchowdhury18 commented 1 year ago

Very cool! Glad to hear that this approach is working well. And thanks to you for the useful discussion :)