lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
480 stars 147 forks source link
api backendai cloud-computing containers distributed-computing docker documentation hpc monitoring paas python

Backend.AI

PyPI release version Supported Python versions Wheels Gitter

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.

It allocates and isolates the underlying computing resources for multi-tenant computation sessions on-demand or in batches with customizable job schedulers with its own orchestrator. All its functions are exposed as REST/GraphQL/WebSocket APIs.

Contents in This Repository

This repository contains all open-source server-side components and the client SDK for Python as a reference implementation of API clients.

Directory Structure

Server-side components are licensed under LGPLv3 to promote non-proprietary open innovation in the open-source community while other shared libraries and client SDKs are distributed under the MIT license.

There is no obligation to open your service/system codes if you just run the server-side components as-is (e.g., just run as daemons or import the components without modification in your codes). Please contact us (contact-at-lablup-com) for commercial consulting and more licensing details/options about individual use-cases.

Getting Started

Installation for Single-node Development

Run scripts/install-dev.sh after cloning this repository.

This script checks availability of all required dependencies such as Docker and bootstrap a development setup. Note that it requires sudo and a modern Python installed in the host system based on Linux (Debian/RHEL-likes) or macOS.

Installation for Multi-node Tests & Production

Please consult our documentation for community-supported materials. Contact the sales team (contact@lablup.com) for professional paid support and deployment options.

Accessing Compute Sessions (aka Kernels)

Backend.AI provides websocket tunneling into individual computation sessions (containers), so that users can use their browsers and client CLI to access in-container applications directly in a secure way.

Working with Storage

Backend.AI provides an abstraction layer on top of existing network-based storages (e.g., NFS/SMB), called vfolders (virtual folders). Each vfolder works like a cloud storage that can be mounted into any computation sessions and shared between users and user groups with differentiated privileges.

Major Components

Manager

It routes external API requests from front-end services to individual agents. It also monitors and scales the cluster of multiple agents (a few tens to hundreds).

Agent

It manages individual server instances and launches/destroys Docker containers where REPL daemons (kernels) run. Each agent on a new EC2 instance self-registers itself to the instance registry via heartbeats.

Storage Proxy

It provides a unified abstraction over multiple different network storage devices with vendor-specific enhancements such as real-time performance metrics and filesystem operation acceleration APIs.

Webserver

It hosts the SPA (single-page application) packaged from our web UI codebase for end-users and basic administration tasks.

Synchronizing the static Backend.AI WebUI version:

$ scripts/download-webui-release.sh <target version to download>

Kernels

Computing environment recipes (Dockerfile) to build the container images to execute on top of the Backend.AI platform.

Jail

A programmable sandbox implemented using ptrace-based system call filtering written in Rust.

Hook

A set of libc overrides for resource control and web-based interactive stdin (paired with agents).

Client SDK Libraries

We offer client SDKs in popular programming languages. These SDKs are freely available with MIT License to ease integration with both commercial and non-commercial software products and services.

Plugins

Legacy Components

These components still exist but are no longer actively maintained.

Media

The front-end support libraries to handle multi-media outputs (e.g., SVG plots, animated vector graphics)

IDE and Editor Extensions

We now recommend using in-kernel applications such as Jupyter Lab, Visual Studio Code Server, or native SSH connection to kernels via our client SDK or desktop apps.

Python Version Compatibility

Backend.AI Core Version Python Version Pantsbuild version
24.03.x / 24.09.x 3.12.x 2.21.x
23.03.x / 23.09.x 3.11.x 2.19.x
22.03.x / 22.09.x 3.10.x
21.03.x / 21.09.x 3.8.x

License

Refer to LICENSE file.