openwallet-foundation / acapy

ACA-Py is a foundation for building decentralized identity applications and services running in non-mobile environments.
https://aca-py.org
Apache License 2.0
419 stars 512 forks source link

Serverless ACA-Py Agents Proposal #1648

Open rtatton opened 2 years ago

rtatton commented 2 years ago

Context

Serverless computing can provide cost-savings and scalability for many applications. One such offering in this paradigm is function-as-a-service (FaaS), which allows developers to focus on application logic, without needing to worry about server maintenance, scaling, load balancing, or cost inefficiencies. Of course, no lunch is free; these functions have limited compute capabilities and can typically only run for relatively short durations (i.e., 15 minutes). In spite of these drawbacks, deploying ACA-Py agents in a serverless environment could save development time and money, which makes it an interesting possibility.

Because serverless agents can time out, the current (as of writing this issue) developments of persistent queues using Redis and Kafka (see issue #1513 and references therein) are quite relevant and compatible with this design. In fact, the issue that persistent queues are meant to address, as stated in #1513, can more broadly be solved with serverless approaches. As Amazon SQS is a managed service, one notable advantage it has over the Redis and Kafka implementations is that it requires much less developer configuration and maintenance.

Architecture

Partly a result of my own development experience and bias, the following design uses Amazon Web Services (AWS) Lambda Function and Amazon Simple Service Queue (SQS); though, it is likely also possible to implement a similar design using Microsoft Azure Functions or Google Cloud Functions (an exercise left to the reader, I suppose).

In this design, the inbound queue of an agent is an SQS queue. AWS offers a standard implementation and a first-in, first-out (FIFO) implementation. The former provides "nearly infinite scalability" and throughput, at the cost of at-least-once processing semantics and only best-effort ordering of the messages. The latter, while less scalable, provides exactly-once processing and message ordering. Thus, using a FIFO makes more sense to use. In the special case that the receiving agent is also one of these serverless agents, then the endpoint to which the message is sent is the queue URL of the SQS queue. Thus, the outbound queue of the sending agent is the inbound queue of the receiving agent. This need not be the case for this architecture to work, but it is a convenient scenario since no delivery service is necessary (more on this later).

serverless-aries-agent

For AWS Lambda, there is a lot of concepts and details to learn, for someone not familiar with the service (the AWS developer documentation provides a lot of detail to get familiar with the service). However, for this proposal, we really only need to know about two things: a runtime and an extension.

An SQS queue can be configured to be an event source mapping for Lambda, i.e., the SQS service can invoke the Lambda function when it receives messages so that the Lambda function can process the messages. Getting back to the runtime and extension, the runtime is what receives the message, while an (external) extension runs alongside the runtime. In the context of ACA-Py, the runtime is the framework, while an extension is a controller.

For wallet storage, there are numerous possibilities on where the Lambda could store it. Perhaps one of the simplest is to use Amazon's Elastic File System (EFS), which can be configured and attached to the Lambda function.

Recall that Lambda functions can only run for 15 minutes at a time, so we need a mechanism to ensure that the agent attempts to shut down properly prior to that 15 minutes ending. Given that the framework is implemented in Python, one approach to trigger a shutdown is with the sched Python module.

Implementation and Deployment

This section won't cover nearly everything there is to know about AWS deployment with SQS and Lambda. Instead, this aims to highlight the main steps that would need to be taken to get a serverless agent up and running.

SQS Outbound Queue

Thankfully, the heavy-lifting has already been done to allow for a pluggable outbound queue for ACA-Py. Thus, SQS can be just another implementation of the BaseOutboundQueue class, such as

class SqsOutboundQueue(BaseOutboundQueue):

    CONFIG_KEY = "sqs_queue"

    def __init__(self, root_profile: Profile) -> None:
        super().__init__(root_profile)
        try:
            plugin_config = root_profile.settings["plugin_config"] or {}
            config = plugin_config[self.CONFIG_KEY]
            self.agent_uuid = config["agent_uuid"]
            self.sqs = boto3.client("sqs")  # AWS SDK Python package
        except KeyError:
            # Handle similarly to RedisOutboundQueue

    def enqueue_message(payload: Union[str, bytes], endpoint: str) -> None:
        self.sqs.send_message(
            QueueUrl=endpoint,
            MessageBody=encode_payload(payload),  # Similar to RedisOutboundQueue
            MessageDeduplicationId=str(uuid.uuid4()),
            MessageGroupId=str(hash((self.agent_uuid, endpoint)))

Two key things to point out is how the MessageDeduplicationId and the MessageGroupId are defined. The former ensures exactly-once message processing, so each message should have its own unique identifier. The MessageGroupId ensures within-group message ordering.

In the context of DIDComm, messages associated with a given connection ID should belong to the same message group. However, the connection ID is not available in the BaseOutboundQueue (as far as I could tell). To ensure that each agent-to-agent connection is still associated with a unique, we can configure each agent to have an agent_uuid and then hash that value with the endpoint (or maybe just concatenate them).

Lambda Function

As stated above, the framework would be deployed as the actual function, while a controller would be deployed as an extension. Conveniently, the runtime and extension are designed to communicate to each other through an HTTP interface, just like the framework and the controller. This bidirectional communication between runtime and extension can be seen in AWS AppConfig's Lambda extension.

Both the framework and controller could be deployed at the same time (in the simplest case); though it is possible to update Lambda functions (e.g., add extensions, update runtime code).

The following is one approach that may work in getting the agent started and properly shut down in the Lambda function:

from aries_cloudagent.commands import start, provision, upgrade

load_agent()
http_client = get_client()

# This is the Lambda function handler. The event is a SqsEvent that contains message(s).
def handle(event, context):
    message = get_message(event)
    send_message_to_agent(message, http_client)
    # The handler would eventually need to shutdown the agent when time is running low.
    # More messages could be processed, so long as there is enough time.
    shutdown_agent()

def load_agent():
    config = load_agent_config()
    try:
        start.execute(config)
    except Exception:
        provision.execute(config)
        start.execute(config)

Concluding Remarks

I am currently working on this implementation, along with a corresponding AWS CDK library that provides a programmatic way to create these serverless agents via an API. Feedback, questions, and (constructive) criticism are welcomed.

swcurran commented 2 years ago

This is pretty cool @rtatton -- nice work. Would you be interested in presenting this at an ACA-Pug meeting? The next meeting (next week) probably has too much content already, but perhaps the next one (3 weeks)?

rtatton commented 2 years ago

@swcurran That would be fantastic! I would love to get some feedback on the idea. How long should I plan to present?

swcurran commented 2 years ago

Are you on Hyperledger Discord so we can plan this? You can DM me there.

tankcdr commented 2 years ago

Agreed. This is really cool and inline with what I want my team to accomplish...but with Azure :) Have you thought of using Step Functions? I was thinking along the lines of using Durable Functions (Azure equivalent, kind of) to fetch config first and pass to next function to start agent and process messages.

rtatton commented 2 years ago

@tankcdr That's awesome! I had not thought about using Step Functions to invoke the Lambda functions. Right now, anything with the appropriate access control can send a message to an agent's queue. And since the SQS queue is an event source mapping, the invocation of the function is done automatically.

While conceptually straightforward, the idea of a one-to-one mapping between agent and Lambda function means that there is a limited number of Lambda functions that can be spun up in a given account. Being able to somehow ensure that account-level concurrency is not violated as you continue to add more agents to a given AWS account is still an aspect of this approach that needs to be resolved.

For reference, here is the repository I am working in to implement this functionality. I am using AWS CDK so that anyone with an AWS account can run a few CLI commands to get an instance of the service up and running. An API is exposed which allows authorized users to upload controller code, create nodes and, delete nodes, where a node is defined in the above diagram,

node = (Lambda function with framework and controller) + (SQS queue)
swcurran commented 2 years ago

@rtatton is going to talk about this at the next ACA-Pug meeting on April 5th, 8AM Pacific/5PM CET.

https://wiki.hyperledger.org/display/ARIES/2022-04-05+Aries+Cloud+Agent+-+Python+Users+Group+Community+Meeting

@tankcdr -- great if you could join as well and cover what you are doing with Azure as well.

Thanks!