Clickhouse migrations and extensibility points

Description

This issue aims to emphasize the significance of introducing support for ClickHouse migrations to enhance the overall development experience, facilitate efficient evolution and maintenance of the database schema, and establish a foundation for ClickHouse customization within the context of the OARS project.

Proposal

My proposed solution is to create an Open edX plugin that handles ClickHouse schema using Django models and migrations. This approach would simplify the extraction of data from the ClickHouse database by leveraging the existing Django infrastructure and migration capabilities. Alternatively, we can utilize the already available event-sink plugin, eliminating the need to establish a ClickHouse connection and use SQL statements directly.

It is crucial to provide a means of extending the ClickHouse tables through plugins, which would introduce extensibility to the data. This extension mechanism should ensure that each table possesses a unique ID, enabling developers to effectively establish relationships between custom tables or accommodate special use cases.

To achieve this, I propose leveraging the well-maintained django-clickhouse-backend library, which offers comprehensive support for Django's ORM (Object-Relational Mapping) with ClickHouse as the backend database.

Uses cases:

An open edx installation needs to rank users per course based on a given custom metric, so they decide to create an open edx plugin that will take care of saving the relevant data in clickhouse (events, user anonymous data, interaction data) and create relationships with the existing records, they also create an API to fetch the data and extend the learning micro-frontend to fetch this data.
An open edx installation needs to show certain aggregated metrics in the LMS, so they use the clickhouse models to get the data back to the LMS without having to make an integration with the superset API to get the data, which is faster and avoid network overhead.
...

Django is a fairly large project that does many things besides connecting to a database for its ORM: http server, template rendering, form management, static assets, etc. While I agree that it does makes sense to use an existing tool to manage migrations, I suggest that we use a tool that does only that. As an alternative to the Django ORM, I propose Alembic, which is "a lightweight database migration tool for usage with SQLAlchemy".

While I think that Alembic would be better than Django for the job, I'm not going to be super opinionated about it. However, I do have very strong opinions against the idea of tightly coupling edx-platform to Clickhouse. Once we go down that road there is no going back, as we have seen with MongoDB and Elasticsearch: these two data storage solutions are now intricately woven into edx-platform and removing them is a multi-year effort involving pliers and rusted crowbars.

If a plugin does need to interact with Clickhouse, then I expect that this plugin will be implemented similar to a microfrontend. It's all going to be client-side code that interacts with a web API. And if we need to setup a web API, then it means that we will have to create a compatibility layer on top of Clickhouse. There is no reason to assume that this compatibility layer should reside in edx-platform, and I think it should not (see the pliers & crowbars argument above). This API could (should) be implemented in a separate web service. Unless I'm mistaken, this is exactly the purpose of a Learning Record Store (LRS), right?

For these reasons, and quite a few others, I strongly think that edx-platform should remain completely unaware of the very existence of Clickhouse. We should not introduce a dependency of edx-platform on Clickhouse, even via an optional plugin.

I agree that Django isn't well suited to this purpose, especially since the ClickHouse plugin doesn't seem to support some of the features we would definitely need like materialized views. Alembic seems somewhat better in this regard, so I'd consider going that way first. I definitely don't think schema management should be tied to edx-platform, we could just run migrations (either Alembic or the Cairn style) from the Tutor plugin.

I believe that for your use cases you could create tracking log events in the frontend and/or backend to capture the data you want using either the existing tracking log to ClickHouse Vector pipeline or by creating an edx-platform plugin that extends event-routing-backends to transform those to xAPI and use the LRS or Vector pipelines to get them to ClickHouse. From there a custom dbt package can be created to make performant downstream tables and views for querying back through Superset or whatever other custom mechanism you need. I will say that what you get by using the Superset API is baked in security.

In general I agree that we shouldn't have a dependency on ClickHouse from the platform, but I think the clickhouse event sink shows a path where we don't include it as a python dependency and can still use it for the types of large scale real time enrichment data that LRSs and tracking logs really aren't designed for. I'm definitely open to other options if there's something better we can do that doesn't blow up the complexity of the project.

openedx-unsupported / tutor-contrib-clickhouse