jayvynl / django-clickhouse-backend

Django clickhouse database backend.
MIT License
130 stars 21 forks source link
clickhouse database django orm python

Django ClickHouse Database Backend

PyPI - Version PyPI - Python Version PyPI django version PyPI - Downloads GitHub licence GitHub Action: Test Coverage Status Code style: black

Django clickhouse backend is a django database backend for clickhouse database. This project allows using django ORM to interact with clickhouse, the goal of the project is to operate clickhouse like operating mysql, postgresql in django.

Thanks to clickhouse driver, django clickhouse backend use it as DBAPI. Thanks to clickhouse pool, it makes clickhouse connection pool.

Read Documentation for more.

Features:

Notes:

Requirements:

Get started

Installation

$ pip install django-clickhouse-backend

or

$ git clone https://github.com/jayvynl/django-clickhouse-backend
$ cd django-clickhouse-backend
$ python setup.py install

Configuration

Only ENGINE is required in database setting, other options have default values.

In the most cases, you may just use clickhouse to store some big events tables, and use some RDBMS to store other tables. Here I give an example setting for clickhouse and postgresql.

INSTALLED_APPS = [
    # ...
    "clickhouse_backend",
    # ...
]
DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "HOST": "localhost",
        "USER": "postgres",
        "PASSWORD": "123456",
        "NAME": "postgres",
    },
    "clickhouse": {
        "ENGINE": "clickhouse_backend.backend",
        "NAME": "default",
        "HOST": "localhost",
        "USER": "DB_USER",
        "PASSWORD": "DB_PASSWORD",
    }
}
DATABASE_ROUTERS = ["dbrouters.ClickHouseRouter"]
# dbrouters.py
from clickhouse_backend.models import ClickhouseModel

def get_subclasses(class_):
    classes = class_.__subclasses__()

    index = 0
    while index < len(classes):
        classes.extend(classes[index].__subclasses__())
        index += 1

    return list(set(classes))

class ClickHouseRouter:
    def __init__(self):
        self.route_model_names = set()
        for model in get_subclasses(ClickhouseModel):
            if model._meta.abstract:
                continue
            self.route_model_names.add(model._meta.label_lower)

    def db_for_read(self, model, **hints):
        if (model._meta.label_lower in self.route_model_names
                or hints.get("clickhouse")):
            return "clickhouse"
        return None

    def db_for_write(self, model, **hints):
        if (model._meta.label_lower in self.route_model_names
                or hints.get("clickhouse")):
            return "clickhouse"
        return None

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        if (f"{app_label}.{model_name}" in self.route_model_names
                or hints.get("clickhouse")):
            return db == "clickhouse"
        elif db == "clickhouse":
            return False
        return None

You should use database router to automatically route your queries to the right database. In the preceding example, I write a database router which route all queries from subclasses of clickhouse_backend.models.ClickhouseModel or custom migrations with a clickhouse hint key to clickhouse. All other queries are routed to the default database (postgresql).

Model Definition

Clickhouse backend support django builtin fields and clickhouse specific fields.

Read fields documentation for more.

Notices about model definition:

from django.db.models import CheckConstraint, IntegerChoices, Q
from django.utils import timezone

from clickhouse_backend import models

class Event(models.ClickhouseModel):
    class Action(IntegerChoices):
        PASS = 1
        DROP = 2
        ALERT = 3
    ip = models.GenericIPAddressField(default="::")
    ipv4 = models.IPv4Field(default="127.0.0.1")
    ip_nullable = models.GenericIPAddressField(null=True)
    port = models.UInt16Field(default=0)
    protocol = models.StringField(default="", low_cardinality=True)
    content = models.JSONField(default=dict)
    timestamp = models.DateTime64Field(default=timezone.now)
    created_at = models.DateTime64Field(auto_now_add=True)
    action = models.EnumField(choices=Action.choices, default=Action.PASS)

    class Meta:
        ordering = ["-timestamp"]
        engine = models.MergeTree(
            primary_key="timestamp",
            order_by=("timestamp", "id"),
            partition_by=models.toYYYYMMDD("timestamp"),
            index_granularity=1024,
            index_granularity_bytes=1 << 20,
            enable_mixed_granularity_parts=1,
        )
        indexes = [
            models.Index(
                fields=["ip"],
                name="ip_set_idx",
                type=models.Set(1000),
                granularity=4
            ),
            models.Index(
                fields=["ipv4"],
                name="ipv4_bloom_idx",
                type=models.BloomFilter(0.001),
                granularity=1
            )
        ]
        constraints = (
            CheckConstraint(
                name="port_range",
                check=Q(port__gte=0, port__lte=65535),
            ),
        )

Migration

$ python manage.py makemigrations

this operation will generate migration file under apps/migrations/

then we mirgrate

$ python manage.py migrate --database clickhouse

for the first time run, this operation will generate django_migrations table with create table sql like this

> show create table django_migrations;

CREATE TABLE other.django_migrations
(
    `id` Int64,
    `app` FixedString(255),
    `name` FixedString(255),
    `applied` DateTime64(6, 'UTC')
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192 

we can query it with results like this

> select * from django_migrations;

┌──────────────────id─┬─app─────┬─name─────────┬────────────────────applied─┐
│ 1626937818115211264 │ testapp │ 0001_initial │ 2023-02-18 13:32:57.538472 │
└─────────────────────┴─────────┴──────────────┴────────────────────────────┘

migrate will create a table with name event as we define in the models

> show create table event;

CREATE TABLE other.event
(
    `id` Int64,
    `ip` IPv6,
    `ipv4` IPv6,
    `ip_nullable` Nullable(IPv6),
    `port` UInt16,
    `protocol` LowCardinality(String),
    `content` String,
    `timestamp` DateTime64(6, 'UTC'),
    `created_at` DateTime64(6, 'UTC'),
    `action` Enum8('Pass' = 1, 'Drop' = 2, 'Alert' = 3),
    INDEX ip_set_idx ip TYPE set(1000) GRANULARITY 4,
    INDEX port_bloom_idx port TYPE bloom_filter(0.001) GRANULARITY 1,
    CONSTRAINT port_range CHECK (port >= 0) AND (port <= 65535)
)
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY id
SETTINGS index_granularity = 8192

Operate Data

create

for i in range(10):
    Event.objects.create(ip_nullable=None, port=i,
                         protocol="HTTP", content="test",
                         action=Event.Action.PASS.value)
assert Event.objects.count() == 10

query

queryset = Event.objects.filter(content="test")
for i in queryset:
    print(i)

update

Event.objects.filter(port__in=[1, 2, 3]).update(protocol="TCP")
time.sleep(1)
assert Event.objects.filter(protocol="TCP").count() == 3

delete

Event.objects.filter(protocol="TCP").delete()
time.sleep(1)
assert not Event.objects.filter(protocol="TCP").exists()

Except for the model definition, all other operations are like operating relational databases such as mysql and postgresql

Testing

Writing testcase is all the same as normal django project. You can use django TestCase or pytest-django. Notice: clickhouse use mutations for deleting or updating. By default, data mutations is processed asynchronously. That is, when you update or delete a row, clickhouse will perform the action after a period of time. So you should change this default behavior in testing for deleting or updating. There are 2 ways to do that:

Sample test case.

from django.test import TestCase

class TestEvent(TestCase):
    def test_spam(self):
        assert Event.objects.count() == 0

Distributed table

This backend support distributed DDL queries (ON CLUSTER clause) and distributed table engine.

The following example assumes that a cluster defined by docker compose in this repository is used. This cluster name is cluster, it has 2 shards, every shard has 2 replica.

Configuration

DATABASES = {
    "default": {
        "ENGINE": "clickhouse_backend.backend",
        "OPTIONS": {
            "migration_cluster": "cluster",
            "settings": {
                "mutations_sync": 2,
                "insert_distributed_sync": 1,
                "insert_quorum": 2,
                "alter_sync": 2,
            },
        },
        "TEST": {"cluster": "cluster"},
    },
    "s1r2": {
        "ENGINE": "clickhouse_backend.backend",
        "PORT": 9001,
        "OPTIONS": {
            "migration_cluster": "cluster",
            "settings": {
                "mutations_sync": 2,
                "insert_distributed_sync": 1,
                "insert_quorum": 2,
                "alter_sync": 2,
            },
        },
        "TEST": {"cluster": "cluster", "managed": False, "DEPENDENCIES": ["default"]},
    },
    "s2r1": {
        "ENGINE": "clickhouse_backend.backend",
        "PORT": 9002,
        "OPTIONS": {
            "migration_cluster": "cluster",
            "settings": {
                "mutations_sync": 2,
                "insert_distributed_sync": 1,
                "insert_quorum": 2,
                "alter_sync": 2,
            },
        },
        "TEST": {"cluster": "cluster", "managed": False, "DEPENDENCIES": ["default"]},
    },
    "s2r2": {
        "ENGINE": "clickhouse_backend.backend",
        "PORT": 9003,
        "OPTIONS": {
            "migration_cluster": "cluster",
            "settings": {
                "mutations_sync": 2,
                "insert_distributed_sync": 1,
                "insert_quorum": 2,
                "alter_sync": 2,
            },
        },
        "TEST": {"cluster": "cluster", "managed": False, "DEPENDENCIES": ["default"]},
    },
}

Extra settings explanation:

Model

cluster in Meta class will make models being created on cluster.

from clickhouse_backend import models

class Student(models.ClickhouseModel):
    name = models.StringField()
    address = models.StringField()
    score = models.Int8Field()

    class Meta:
        engine = models.ReplicatedMergeTree(
            "/clickhouse/tables/{uuid}/{shard}",
            # Or if you want to use database name or table name, you should also use macro instead of hardcoded name.
            # "/clickhouse/tables/{database}/{table}/{shard}",
            "{replica}",
            order_by="id"
        )
        cluster = "cluster"

class DistributedStudent(models.ClickhouseModel):
    name = models.StringField()
    score = models.Int8Field()

    class Meta:
        engine = models.Distributed(
            "cluster", models.currentDatabase(), Student._meta.db_table, models.Rand()
        )
        cluster = "cluster"

CRUD

Just like normal table, you can do whatever you like to distributed table.

students = DistributedStudent.objects.bulk_create([DistributedStudent(name=f"Student{i}", score=i * 10) for i in range(10)])
assert DistributedStudent.objects.count() == 10
DistributedStudent.objects.filter(id__in=[s.id for s in students[5:]]).update(name="lol")
DistributedStudent.objects.filter(id__in=[s.id for s in students[:5]]).delete()

Migrate

If migration_cluster is not specified in database configuration. You should always run migrating on one specific cluster node. Because other nodes do not know whether migrations have been applied by any other node.

If migration_cluster is specified. Then migration table(named django_migrations) will be created on the specified cluster. When applied, migration operations of model with cluster defined in Meta class will be executed on cluster, other migration operations will be executed locally. This means distributed table will be created on all nodes as long as any node has applied the migrations. Other local table will only be created on node which has applied the migrations.

If you want to use local table in all nodes, you should apply migrations multiple times on all nodes. But remember, these local tables store data separately, currently this backend do not provide means to query data from other nodes.

python manage.py migrate
python manage.py migrate --database s1r2
python manage.py migrate --database s2r1
python manage.py migrate --database s2r2

Update

When updated from django clickhouse backend 1.1.0 or lower, you should not add cluster related settings to your existing project. Because:

If you really want to use cluster feature with existing project, you should manage schema changes yourself. These steps should be tested carefully in test environment. Clickhouse docs may be helpful.

  1. Apply all your existing migrations.
  2. Change your settings and model.
  3. Generate new migrations.
  4. Log into your clickhouse database and change table schemas to reflect your models.
  5. Apply migrations with fake flag.
python manage.py migrate
# Change your settings and model
python manage.py makemigrations
# Log into your clickhouse database and change table schemas to reflect your models.
python manage.py migrate --fake

Test

To run test for this project:

$ git clone https://github.com/jayvynl/django-clickhouse-backend
$ cd django-clickhouse-backend
# docker and docker-compose are required.
$ docker-compose up -d
$ python tests/runtests.py
# run test for every python version and django version
$ pip install tox
$ tox

Changelog

All changelogs.

Contributing

Read Contributing guide.

License

Django clickhouse backend is distributed under the MIT license.