Billing and API Key Management

sekulicd commented 9 months ago

Project Overview

The goal is to develop a platform that integrates with an existing system to safeguard running services by enforcing users to provide a valid API key. The UI/UX should draw inspiration from the ChatGPT API Key Management platform, focusing on credit-based payments without a subscription option.

Target Users

Admin: Will use the current prem-app as an admin dashboard. In this role, they should be able to create API keys (e.g., without constraints), view usage, etc.
Users/Developers: Will use a new frontend app to create API keys, top up credit balance, and view usage of API keys.

Core Features

Identity Management

Users should be able to register/login to a web app separate from the current prem-app dashboard.

API Key Management

Users should be able to generate one or more API keys.
API keys should have several levels of access constraints based on requests per minute (rate limit), tokens per minute for each service (service path), and a credit limit, which is a cumulative number of tokens determined by the user balance (available credit amount).

Billing

Users should be able to top up their credit balance using various payment methods: credit card, BTC, etc.
Users should be able to view usage and remaining credit balance.

Usage/Analytics

Users should be able to view the usage of their API keys to understand consumption and remaining access.
Administrators should have access to analytics to monitor service usage, identify trends, and optimize offerings.

Integration

The platform should integrate with the existing system, which consists of services running behind a Traefik proxy.
The existing Admin frontend application (prem-app) should be enhanced to enable the admin to create API keys and view Billing/Usage analytics.
Integration should be achieved through Traefik forward auth middleware to ensure authentication and authorization.

Main Flow

Admin API Key constraints creation: Admin creates API key constraints per service and configures the price of 1K tokens. Users will access prem-service based on API Key constraints which include:
- Service Constraints: For each service, there will be a rate limit (number of requests in a minute) and usage limit (number of tokens in a minute).
- Balance Constraints: Each balance can be converted to an accumulative number of tokens (token credit), based on the price of 1K tokens. For each user, the usage (number of tokens used) will be tracked, and access will be denied if the user of the API key runs out of token balance.
User Registration and API Key Creation: User registers/logs in to the Identity Management platform, tops up credit balance with a payment method, and creates an API key. There is a consideration whether users should have the option to create a key for one or more services or if a key should be for all services.
Request Verification: User makes a request to prem-service using the API key. The platform checks if the API key exists, whether the related user has enough balance, and whether the rate and usage limits for the desired service path are adhered to.

@tiero @filopedraz

tiero commented 9 months ago

w.r.t tokens

The best way to represent the atomic resource is to represent the underlying compute, rather the length of the prompt of the request and response

It should be a combination of time (minutes of billing) and "weight" of the RAM on the total available, to run the cumulative transformer blocks that are served to consumers.

Think of Ethereum Gas consumption as an example; you estimate the GPU compute in advance (with static analysis you can infer the cost of each Operator that maps to a collection of instructions for a model

Ie.

sekulicd commented 9 months ago

w.r.t tokens

The best way to represent the atomic resource is to represent the underlying compute, rather the length of the prompt of the request and response

User cost (what a user pays for one interaction) is a sum of:

Computational cost: This relates to the GPU cost different for a specific model.
Service fee: A constant that accounts for our profit and overheads.

While our service fee remains static, the compute cost varies based on the 'weight' of the user's request. For instance:

For LLM models, the cost can be tied to the 'number of tokens'. For image models, it relates to the resolution of the image. For audio models, it depends on the duration or number of minutes.

To make this practical, we could: Calculate the GPU cost (combination of time and "weight" of the RAM) for each prem-service Clearly display the pricing for each category, whether that's per 1K tokens, per specific image resolution, or per minute of audio. My argument is this: GPU costs can be approximated with a user's 'quota' and we need to simplify the computational costs in a way that users can easily estimate their charges based on their intended requests. This clarity will not only improve user experience of both admin user and regular users but i think this abstraction can be helpful in development of these features.

For eg. in case of Etherium price is decided based on 'Gas used' which is abstraction similar to 'number of tokens/img resolution/minutes of audio' and Gas price(inn our case this price is fixed and it will directly reflect GPU computation cost)

@tiero

tiero commented 9 months ago

vRAM consumed
FLOP/s consumed

premAI-io / prem-gateway