Proposed RFC Feature: AI Core Gem

Summary:

With the recent raise of generative AI models such as GPT 4, tools are emerging to plug them into existing workflows and to create new workflows that they enable. This proposal brings forth the new AI Core Gem, which is meant to help O3DE developers to utilize modern AI in games and simulations.

What is the relevance of this feature?

Given the new possibilities coming from recent advances in AI, game and simulation developers are looking to apply these new capabilities in their creations. Many steps that these developers need to take are common regardless of type of application, for example:

Connecting with AI services that are remote or local, 3rd-party or self hosted.
Communicating with multi-model, multi-modal (text, image, video, audio, ..) AI services.
User experience and interface for prompting, handling errors, iterating on tasks, and basic human-AI collaboration workflow.
Global configuration for URI, usage limits, default models, etc.
Simple visualization of AI outputs and logs.
An extendable Remote Procedure Call (RPC)-like API for AI to call, enabling whitelist approach to AI command of O3DE.
Default support for a selected 1-2 best (by some metric) open source models.
Modular design with good developer interfaces, allowing to easily add new modalities and features.
Support/abstraction for awareness of used tokens, pricing and quota (to be implemented in vendor Gems).

The AI Core Gem is quite different from Machine Learning Gem in that it focuses on Generative AI instead on multi-layer perceptrons. Considering dynamic nomenclature in the space and versatility of the Gem though, there are considerations against including "Generative" in the name.

The AI Core Gem is meant for O3DE Gems developers, and it is expected to be a dependency of future gems such as AI characters, assistants, and scene generators. Unlike these future gems, The AI Core Gem value does not strictly depend on current capabilities of AI models, as it is meant as a tool to explore their limits and is meant to be built in a flexible way, benefiting from improvements in these capabilities over years.

In the long run, from the perspective of game development, this feature can help to build smart characters that interact uniquely with the player, and assist in writing dialogue as well as creating 3D world.

From the perspective of simulation, it can help to create robots, humans (in roles such as pedestrians, warehouse workers) and animals that behave in certain ways with less scripting, build smartly randomized simulation scenes and assist users in running validation scenarios as well as summarizing their results.

Feature design description:

Connectivity and communication with Generative AI services

Generative AIs can be used through 3rd-party hosted services, such as Amazon Bedrock or Open AI's GPT platform. These typically involve a pricing model per token, depending on model type and modality. There are proprietary models and open-source models. It is also possible to host models locally, including through tools such as Ollama or vLLM.

AI services increasingly offer additional modalities (such as image prompting) as well as complex services, such as Assistants.

Since the pace of development and emerging of new APIs is rapid, it is important for the AI Core Gem to be flexible and extendable in its implementation of connectivity and communication. As such, the approach is to be:

Abstract common APIs such as model selection, text prompting and response streaming, so that these abstractions can be easily adapted into vendor-dependent APIs. Provide default (and extendable) set of RPC-like O3DE APIs for AI to interact with.
Include implementation for one selected open source model only (for example, Mixtral).
Leave vendor-specific APIs to other dependent Gems. For example, developers might build AWS AI Gem or OpenAI Gem.

The communication layer will be abstracted, allowing to support local-network run models as well as local-device GPU in the future, as well as streaming connections such as websockets. With the first release, the feature set will rely on HttpRequestor gem for communication.

Global settings

While it makes sense to picture a use-case where more than one vendor's AI is used within one project, as they can easily have different strengths, as a first step it is good to start simple and have one global setting for the AI features, much like Physics Settings.

These global settings will include URI and other connectivity settings such as for authorization, usage limits, default models for each modality, and user preference settings for things like visualization.

The first version will only include URI and default model selection.

Global settings will be accessible through Editor menu and through registry key settings.

AI calling O3DE interfaces

Core value of this gem is to allow AI to perform work through O3DE interfaces. Examples include:

Placing and moving objects in the scene based on existing prefabs and assets, to rearrange the scene, e.g. create variants of it.
Creating new levels.
Creating new entities with components.
Interacting with Asset Processor to add generated 3D objects.
Adding a script.
Calling EBus Editor APIs of components, for example to configure them.
Calling EBus APIs of many components in game mode, for example to cause directed movement, pursue goals etc.
Requesting specific data (see #o3de-sharing-data-with-ai).

These APIs will be exposed through a kind of reflection mechanism, but full documentation needs to be also shared, either as URIs or initial prompt.

O3DE sharing data with AI

To interact with O3DE in an informed way, AI will require inputs such as:

Generic text prompts, such as definitions of tasks, context, etc. Such prompts will be first ran through a prompt engineering interface (as a bus call), which will by default leave input unmodified (no-op), but could be replaced by plugins to any of the prompt engineering tools.
API (function to call, its signature) including full documentation.
User's current viewport as image, or streamed. Other forms of tracking user's activity in O3DE might also be helpful. This is not planned for the first release though.
List of available assets, prefabs.
Various files required to accomplish tasks.
For characters, their goals, behavior specification (through text) as well as specialized directives such as Robot Constitution.
In runtime, pose and sensor data for AI-controllable characters (for future gems).
In runtime, contacts, collisions and other dynamic scene information for the purpose of understanding what is happening in the virtual world.

Runtime interaction with characters and environment might be specific to application, for example simulations are likely to expose ROS interfaces. For the first implementation, sharing list of assets, generic text prompts and callable methods is enough.

Future extensions

The gem can be extended with a voice interface, allowing to prompt and give tasks to AI by simply speaking. Note that text-to-speech is a part of some vendor APIs already, so that AI can speak back.

RFCs for new AI feature gems will follow when this gem is implemented.

Technical design description:

Challenges of AI RPC interface

To rely O3DE APIs to AI service, we need to supply information of functions' signatures, document their purpose, semantics, parameters and return values. The immediate issue is that the documentation is typically provided as code comments and itself unavailable at runtime. Such documentation is also not provided through the current behavior reflection system.

Possible solutions include:

Using and possibly extending Behavior reflection to work well for this case. (e.g. behaviorContext->Method).
Relaying on published documentation such as O3DE Gem API Documentation and other pages (including intranet).
Providing a processed output of Doxygen generator ran on the code base.
Supplying header files with interfaces.
Providing custom documentation when defining APIs in initial prompt or attached file.

There are various drawbacks to each of these solutions, such as ensuring good workflow when custom gems are involved, including proprietary ones, exposing code base headers to 3rd party (potentially licensing issues), blast radius of changes in O3DE, avoiding noise for AI such as APIs which are not accessible, not whitelisted or irrelevant, the issue of essentially copying information with custom documentation approach. There is also an issue of being able to register interfaces and assign callbacks dynamically. Another consideration is that AI might benefit from a custom, iterative approach to return values and how much feedback it needs to function at optimal performance, which can differ significantly from how current APIs are constructed.

One considered approach is to use and possibly expand on behavior reflection system in O3DE, either through providing a way to generate AI-suitable reflection, including serializing to JSON or similar format, at least for selected category, or by creating another layer in the reflection system.

The other approach considered approach is a custom API registration system which supplies function name and signature, its documentation and callback. Custom approach can help to simplify types, dependencies and amount of context AI needs to succeed.

Community comments on RPC design are especially welcome.

AI -> O3DE interface

The API registration mechanism needs to be a part of AI Core Gem developer's interface, so that custom gems and their components can register new ways of interaction.

The AI Core Gem RPC System Component will rely static RPC (constructed through Reflection mechanism) to the AI service, and allow for dynamic attachment of callbacks to existing registry entries (otherwise callbacks are considered empty, which should cause warnings). It will also allow to dynamically add API entries (including the callback). The method of relying such API description to the AI service may be implementation-dependent, by default using text prompts, but some implementations might instead produce a file with static API description and upload it to AI Assistant-like service.

The AI service will be instructed (by internally-captured, configurable prompts) to call APIs within a specified text block, making parsing of its response straightforward. Most likely, JSON format will be used to structure the API calls in text, which is a common approach in different contexts, see libjson-rpc-cpp library as an example.

O3DE -> AI interface

Providing data to AI service will be implementation dependent. AI Core Gem will include text prompting, but might include image prompting once available in popular open source models. Other modalities might be included if standardized and implemented by most of vendors. Until then, modalities other than text will be left to vendor-specific gems.

In the context of robotics, modalities other than text will be especially important, for example images from robot camera sensors, or audio commands from its human co-workers.

What are the advantages of the feature?

Once this gem is released, O3DE developers will be empowered to develop AI-based features based on this gem. This will bring users looking to explore and develop AI applications for academic or industrial use-cases to O3DE.

What are the disadvantages of the feature?

Given that the AI space is extremely dynamic, this gem needs to be supported and updated continuously. It needs to stay relevant as the space expands and AI-empowered tools become commonplace.

There is also a considerable effort to decide which interfaces to expose and understand what is possible with the technology.

Are there any alternatives to this feature?

The main alternative is treat AI as set of external tools and to focus on developing rich API for O3DE to interact with these, as opposed to integrated approach that this proposal describes.

While integrated approach involves writing some extra wrapper code and developing O3DE-side UI/UX, the advantages lie in tailored approach to collaborative content creation and ability to work better with Editor workflows. These are main reason for preference for the integrated approach.

Another alternative is not to have the AI Core Gem, but instead one gem per vendor, including open source. However, this has disadvantages of repeating the common part and it doesn't help to have the same UX for AI users in O3DE, where a common use-case will be to compare performance of AI from several vendors.

How will users learn this feature?

The Gem will be a part of canonical set, documented and cross-referenced in O3DE documentation. Publicity for the Gem is also planned, and showcase demo will be released in 2024. The gem will likely be presented alongside other AI gem(s), as it focused on core functionalities rather than user-facing features.

Are there any open questions?

How to best serve the documentation of API to avoid repetitions (with Doxygen comments) as well as respect versioning and closed source Gems?
Which use cases are possible with current performance of generative AI?
How will developers and users of dependent AI gems be able to determine whether performance is good enough for their application?
How to best protect against incurring of unreasonable costs possible due to mistakes made by developer in automated jobs that use of AI (such as infinite loops)? Should this be left to 3rd-party service account configuration? Token limits per automated job?
What is the best security model to apply?
How to best handle long wait times for some of AI work / response?
How to best handle workflow with multiple AI agents?
What is the best way to relay documentation of APIs to AI which is supposed to call them?
Should we use existing Behavior Reflection mechanism, which will require changes, or a more tailored approach?

Comments expected by Feb 23rd.

Figuring out the name is one of requirements. Ideas include:

Gen AI Core Gem
LLM Core Gem

I like the idea. This should not be a core gem as core gems should be what is minimally needed to get the engine to compile and run. This should be in its own repo or part of extras.

LLM Gem seems more appropriate if you're going to stick to transformer based model integration. GenAI Gem seems appropriate if you have plans to integrate diffusion models and other model architectures, which might be really neat from an 'automatically generate textures and video clips' standpoint.

As mentioned in the TSC meeting; in terms of binding, having something dump out the available interfaces by iterating, say, all our script canvas nodes and plopping those on the command-prompt (with an appropriately large context window) would probably go a long way towards improving the usefulness and accuracy of the integration. Even if we fine-tuned a model, which I would really recommend, using some sort of RAG-like approach would really boost performance.

Sticking to json/xml based dumps of API's and scenegraphs, and then consuming LLM output as prefabs and script-canvas graphs seems promising. That approach would then scale to any other Gems the user happens to have installed, including the ROS2 Gems or even the Machine Learning Gem which currently offers a limited set of script canvas nodes. There's a lot of UI work ahead to make this well integrated from an end-user perspective, but that can luckily all be decoupled from this initial RFC.

I definitely love the idea! I'll be keeping an eye on this.

I'll dump the notes from discussions about this that happened live in the TSC, before I add my own

When will this close? --> About a month.
Is this for LLMs specifically? Yes, but the name 'LLMs' will change over time and this name would become deprecated pretty soon.
When this RFC mentions local gpu, it means on-prem, not necessarily on same machine. Currently via websocket, but long far future might include on same hardware (?)
It was suggested that it might be possible to generate documents and assets (such as prefabs or other assets) directly, since they are JSON or XML, and LLMs can generate those. To be evaluated
It was suggested that the context window size might mean to dump the context functions with minimal documentation.
It is possible to iterate over the behavior context and dump it.
It was recommended that the RFC be followed by a technical design that encompasses a few use cases BUT the initial implementation and tech design be kept small so as not to spend too long in design phase. The minimal releasable thing would be something that can satisfy at least one example use case (and includes such a use case in another gem), and could withstand the addition of at least one more different usecase without a radical redesign, even if it meant a large extension of the base framework, the trick is to avoid the redesign the moment a second usecase presents itself.

reading the RFC, my only concern is that it becomes way to broad initially . I understand here we want to have a vision, and thats fine for RFC level. I imagine it would be good once the RFC is generally accepted to offer a small technical breakdown of what APIs would be in v1 (with one example gem that uses the APIs functionally) and then what would be in V2 (with an additional different usecase) proving that the APIs in v1 will not have to be disrupted and reworked, only extended with V2. V2 does not have to then actually be developed, just V1, until someone wants to add V2 or some other library, it would at least prove it out.

This is a good place to mention that O3DE does already have support and examples of so-called Framework gems - that is, a gem which provides APIs and busses and functionality that is only useful when a different gem uses it/depends on it. likely the actual technical structure of this would be such a framework gem, with at least one (although it could have as many as you want) .API modules that other gems depends on. The API modules could be kept nice and lightweight, header only or very nearly header only, to avoid dll bloat.

As for ways to expose the engine to AI, there's basically a number of ways, but the official somewhat well-traveled route is through the behavior context, since its job is literally to offer the functionality of the tools and engine in a neutral way that can be exercised by whatever, including new languages or interfaces. Its a complex path, but it is well-traveled since there are already examples of mining the behavior context for python, for lua, for script canvas, and its already had someone adapt it for javascript (without releasing it, but at least it proves that its flexible enough to bind to whatever you want to bind it to).

Directly generating things like prefabs may be okay too - but it depends on how good LLMs are at generating actual viable json with a bunch of tricky rules, without hallucinating things that come from other projects or similar situations. my experiments in this realm have not currently been very positive, with it working sometimes, but sometimes starting to spit out unity or unreal formatted documents or just imagining types and apis that simply don't exist at all.

Based on comments above, would GenAIFramework be a fitting name for the Gem?

There is quite a few points that I would like to highlight

The idea sounds good and we should definitely take this direction. However, to me at least, the document feels more like a vision that does not include a concrete proposal for a feature with a specific purpose
Based on the feedback from the sig-triage I understand that the objective of this gem is to generate randomized worlds based on the LLM capabilities. I am currently mostly working on this kind of features and my insight is that only commercial LLMs like GPT4 seem to have some kind of capability to achieve such task. Furthermore, they are still not reliable for very complex environments, therefore it would make sense that the LLM could, for example, randomize items on top of a table but not generate an entire room, at least not accurately according to human logic. This is not all bad as we could expect that the user will start from the "bad template" of the LLM and continue from there.
Another highlighted feedback I received is that the implementation of this gem aims to "interface O3DE to LLMs in the most general way possible" this to me sounds like just enabling the easy access of an LLM through the canvas script, which is not bad and there are some plugins in other engines that just offer this functionality, however if this is the objective, or at least the first milestone, this is not clearly reflected in the proposal. Thus, I think it should be clear that between enabling access to the LLM in the canvas script and actually spawning worlds there is quite a large amount of work that would be needed and it would be its own milestone. Prompting engineering like this is not intuitive unless you have experience.
I suggested this in the triage and I will leave my comment here as well, that when we are proposing these features we are overlooking the perspective of a user (or game/simulation developer in this case). It is not being highlighted how specifically we expect that the user will use this feature and how it will be helpful for the user. A few suggestions from what I have seen out there is that the LLM could serve as an assistant bot to make daily tasks faster, e.g. writing to the LLM "Place 3 different cars in the center of the scene" might be faster than searching through the project folder to find such asset by hand, with this example in mind I provide a few more prompt examples that come to mind:
- Set the color of the couch in the scene to green
- Place 10 light poles with equal distance to each other next to the street object
- Place 50 ground robots in this warehouse and ensure they have the physics component activated with high friction
I see that the gem does not intend to include generative AI models like diffusion models or text-to-3D models. If my previous point is relevant and it wants to be considered as part of the vision then I think it is important not to completely ignore these generative models as it is extremely likely that users will intend to ask the LLMs for layouts (and eventually scenes) that include 3D assets or textures that might not be present in the current project.
Lastly people seem very enthusiastic with using LLMs for code generation, enabling this will for sure accelerate game/sim development and it can be a selling point of the engine since it will make people 1) develop faster for sure 2) new comers will get generated examples on how to achieve things which will get them familiarized with the engine a lot faster than going through difficult documentation. Of course documentation references could be included in comments with the generated code.

I know that most of these are also big milestones that cannot be implemented on the first demo or in the near future, but missing this perspective and not sharing it might make difficult for others to understand what is the intended end use of this gem, in other words this means showing the difference between "this is how we expect the user to use this gem" vs "this gem includes feature x, y and z", which translates to "this is an electric guitar and you can make rock music like this with it" vs "this is a electric guitar and it can produce distorted sounds"). I would not have highlighted all of these but since I got the two feedbacks of "this is a general implementation" BUT "we want to generate worlds in the firs demo" it sounds that at least one specific usage from the developer is expected with this gem.

If anything is unclear or you would like to discuss additional feedback do not hesitate to let me know

Based on all the feedback, I will take the following steps:

Post a new RFC that dives into technical design and detail (early next week), addressing comments.
Name the Gem GenAIFramework.
Describe how to use the Gem for scene layouting.

o3de / sig-simulation