jupyter-naas / site

Naas Official Documentation. Contact us for more information about our early beta: support@naas.ai
https://docs.naas.ai
6 stars 3 forks source link

WIP: Credit consumption monitoring for v2 #35

Closed Dr0p42 closed 10 months ago

Dr0p42 commented 1 year ago

Need for resource consumption monitoring

To be able to offer the lowest prices as possible for our customers we need to have a good tracking mechanism of what our users are consuming.

Today, multiple products are directly leveraging credits.naas.ai as soon as the consumption is happening. For example, when a user is interacting with a chat on workspace.naas.ai, as soon as we get back the completion from an LLM, we create a transaction that will then be stored in credits.naas.ai.

Handling that at the API level is fine for now, but in the future it would be better that those APIs just send the consumption event to a Kafka topic for example and then a consumer would take those events, link the resources consumed to a price sheet which can be generic or specific to the customer and then create and store the transaction. It would be easier to apply modification on pricing as the computation would happen in a consumer and won't be spread across multiple services.

With the new development, we are going to start deploying Cloud Provider resources:

What defines a consumption

To properly answer the need of following user consumption, we need to define every aspect of a user consumption.

The credits

First we have the "Credit", it's a value that can be bought using real world money. It's associated a price conversion to it so we can defined its price per dolars (or euros, etc...)

Users will buy credits through subscriptions that will then allow them to use our different services and features.

When reaching 0 credits, the user should not be allowed to consume anymore. At that point it should be possible for the user to have a way to buy more credits, either by changing subscription type, or any other available mean.

The services

At naas we have and will have multiple services, some are free, other are paid. Each service will consume different types of resources.

Space/Schedulers/Pipelines/Webhook will consume over time:

Registry will consume:

Notification will consume per request:

Naas Chat will consume per message:

Naas Storage will consume over time:

The prices

For each service we need to have a price conversion allowing us to know, based on the consumption data (CPU/Memory/Bandwith/GPU/Storage/Email/LLM Tokens/...) how much credits should be billed to the user.

The user prices

We have the need to have a standard pricing table, but we also need to be able to have user/organization specific pricing table as we may have different deals.

So it's important to be able to link a user_id to a sspecific service credit rate.

The events

So now that we defined the existence of services, credits and prices (credits / service) and users (credits / service / user), we must define a way to effectively create transactions based on user usage of the platform.

The complexity that we are faicing is that we want our platform to be deployable on multiple Clouds providers or even on premises, but also, each service that we are building may rely on multiple third parties technologies.

For exemple, Space/Schedulers/Pipelines/Webhook relies on Kubernetes, which is great because it's cloud agnostic. And we saw that those services will consume overtime:

But that's not all, 1vCPU on an intel Celeron or 1vCPU on an high end Xeon processor is not the same amount of CPU power, it goes the same for the memory, and the GPU. All of those are tied to a "machine type" which need to be know as well to be able to compute the right credits consumption.

To do that we need to emit "event" when the user is using paid services. On Kubernetes for example we can look for pod creations, what services it's coming from, what are the allocated resources, on which instance it's running and we need to send an event to a backend when we see that.

The event in itself won't be sufficient to know what is the real user consumption, indeed, in the case of the lifecycle of a container, it will be:

So in fact, we need to know the sequence of event to be able to compute the right credits consumption.

For Naas Chat it's simpler because it's transactionnal:

It's also easy for Naas Notifications, where we know the price of an email so it's easy to track.

For Naas Reigstry it's more complex as we need to know when a registry is created, follow the size it takes, look for security scans, bill accordingly the user during the month. The registry today relies on AWS ECR but as we said earlier, we might need to handle different Container Registry provider based on the selected cloud provider.

For this we might need to emit events when something happens in the cloud provider. We might want, for AWS, to listen for Cloudtrail events, but it might not be sufficient, as we need to emit periodic event to know the state of the registry (size, etc).

The consumptions

The consumption will be triggered when a sequence of event reaches a state where we have a way to actually convert the event into a transaction.

For Space/Schedulers/Pipelines/Webhook we can do that when we have:

At that time we will be able to store that actual consumption so the transaction will be able to be computed.

The transactions

Listening on the consumptions, we will then be able to get user prices and create the approriate transactions in credits.naas.ai api.

Kubernetes

The first need for us will be to monitor consumption in Kubernetes, indeed this is our primary source of user consumption (space, scheduler, lab).

I think that we should have a tool running in the cluster and watching for Kubernetes API events.

The role of this software would be to:

What should be part of the compute of the consumption cost

Consumption in EKS should be measured based on time + kind of resources + amount of that resource + price. The kind of resources that we should monitor are:

Being also aware of the type of instance on which the resource was deployed could be interesting, at some point we will definitely want to know what was over provisioned in our infrastructure to allow us to serve our customer as this will be a source of cost optimization.

Cloud provider

For cloud provider monitoring, I think that it depends on the cloud provider, but we need to be able to link a resource to a user and watch for that resource periodically.

For resources that will be only be deleted through our API, these APIs can be responsible to store and update the required information. If at some point we have resources that can be removed automatically via the cloud provider, then we need to monitor what is happening as well on the APIs of the cloud provider. For example, with AWS, we could use AWS Cloudtrail to monitor creation/update/deletion of some resources that we know could be removed by the provider directly.

⚠️ To be continued.

Storage

We need to define a data format to store what is a resources, its link to a user and also its different states.

⚠️ To be continued.

jravenel commented 1 year ago

Good first piece @Dr0p42 ! So the end game would be to completely rely on Kubernetes API events for every Naas Jobs:

Dr0p42 commented 1 year ago

I PoCed the way we could look out for events in Kubernetes and it seems quite straight forward 👍

from kubernetes import client, config, watch
import pdb
import uuid
from datetime import datetime, timezone

worker_id = str(uuid.uuid4())

config.load_kube_config()

v1 = client.CoreV1Api()

watcher = watch.Watch()
print('✅ Starting to listen to Kubernetes events!')
for event in watcher.stream(v1.list_pod_for_all_namespaces):

    # Looking for Pods scheduled by users.
    if event['object'].kind == 'Pod' and 'user_id' in event['object'].metadata.labels:

        obj = event['object']

        # If its a pod belonging to the Space api.
        # We might want to store in a label the naas_service that this containers belongs to.
        if 'serving.knative.dev/service' in obj.metadata.labels.keys():

            print(f"""Event:
                worker_id: {worker_id}
                worker_event_id: {str(uuid.uuid4())}
                worker_event_time: {datetime.now(timezone.utc)}
                type: {event['type']}
                kind: {obj.kind}
                naas_service: space
                uid:  {obj.metadata.uid}
                status: {obj.status.phase}
                user_id: {obj.metadata.labels['user_id']}
                creation_timestamp: {obj.metadata.creation_timestamp}
                deletion_timestamp: {obj.metadata.deletion_timestamp}
                resources: {obj.spec.containers[0].resources.requests}

            """)
            #                 status conditions: {obj.status.conditions}

            if False and event['type'] == 'DELETED':
                pdb.set_trace()

When going on the url: https://mytest-vini.default.nebari.dev.naas.ai/

These are the events that we can see happening:

Starting phase

 !  ◰³ base  ~/d/n/p/kubernetes-credits  make                                                                                                                                                                                                            (base) 7.8s  Mon Oct  2 08:06:17 2023
kubectl config use-context arn:aws:eks:us-west-2:903885477968:cluster/nebari-dev-naas-dev
Switched to context "arn:aws:eks:us-west-2:903885477968:cluster/nebari-dev-naas-dev".
✅ Starting to listen to Kubernetes events!
Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 7165609e-501d-4619-bf36-505b9e096956
                worker_event_time: 2023-10-02 06:06:25.399993+00:00
                type: ADDED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Pending
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: None
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: abcab032-1092-4140-bf9e-397c1a53f7e8
                worker_event_time: 2023-10-02 06:06:25.618918+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Pending
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: None
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 0db2b6af-cdb4-44b7-b248-dee70fab0d94
                worker_event_time: 2023-10-02 06:06:25.824911+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Pending
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: None
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 8c76f9a1-3549-4151-ae91-b5457f9e1b85
                worker_event_time: 2023-10-02 06:06:27.276165+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: None
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 9777a11c-4f3a-46d2-86fa-0351ab92e21e
                worker_event_time: 2023-10-02 06:06:27.283124+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: None
                resources: {'cpu': '2', 'memory': '2Gi'}

Stopping phase

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 9750f450-8c78-41df-ad73-9a6b76f9070e
                worker_event_time: 2023-10-02 06:07:29.736124+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: 2023-10-02 06:12:29+00:00
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: d280092f-cbc8-4852-bc23-c705c9442f00
                worker_event_time: 2023-10-02 06:07:56.125713+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: 2023-10-02 06:12:29+00:00
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 4713d32d-96c7-4cab-b524-6136d5700597
                worker_event_time: 2023-10-02 06:08:00.506142+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: 2023-10-02 06:12:29+00:00
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: 1361eb20-60e2-4c3a-8201-9e6d569afe38
                worker_event_time: 2023-10-02 06:08:00.513289+00:00
                type: MODIFIED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: 2023-10-02 06:07:29+00:00
                resources: {'cpu': '2', 'memory': '2Gi'}

Event:
                worker_id: e8a0ee50-4dcc-4c60-938d-39b38d3cde6c
                worker_event_id: e723874e-721f-4b32-92ff-99f9c16cc093
                worker_event_time: 2023-10-02 06:08:00.812852+00:00
                type: DELETED
                kind: Pod
                naas_service: space
                uid:  bf8f6bb5-3f00-41f2-a865-aae6dd8ba6ea
                status: Running
                user_id: ec764dd4-0c7a-42d5-ac29-a028f84ad3de
                creation_timestamp: 2023-10-02 06:06:25+00:00
                deletion_timestamp: 2023-10-02 06:07:29+00:00
                resources: {'cpu': '2', 'memory': '2Gi'}