Model support for Citus Data

kirkbrauer commented 3 years ago

With Microsoft's recent acquisition of Citus Data and their rollout of Hyperscale (Citus) for Azure Database for PostgreSQL, we should expect more Azure users to want Citus integration for their EF Core PostgreSQL database models.

Adding an extension package (Npgsql.Citus?) would allow users to opt into Citus support and configure their distributed tables, reference tables, local tables, and set access methods using LINQ model configuration extension methods.

I believe this could be accomplished using custom entity annotations in migrations that generate the required SQL to configure Citus.

Any suggestions for this would be greatly appreciated!

roji commented 3 years ago

@kirkbrauer we can certainly implement model configuration for Citus that would affect generated migration DDL; there's a proposal to do that for PostgreSQL-XL, with #1697 for an implementation which can be used as inspiration.

Can you point to the Citus docs with the specific DDL features etc?

kirkbrauer commented 3 years ago

Here is the documentation for the Citus DDL Creating and Modifying Distributed Tables DDL.

Since Citus is just an extension, there is no custom DDL, configuring tables just requires SELECTing various Citus functions that come with the extension. These functions only need to be run after creating the table, dropping tables just uses the regular PostgreSQL DDL.

One thing we do need to make sure is that the the extension is enabled for the database by running CREATE EXTENSION citus. This might be an auto configuration or we could require the user to use the Npgsql extensions API to create it.

roji commented 3 years ago

@kirkbrauer sorry for the long silence.

The best way forward here, is for someone who's familiar with Citus (maybe yourself?) to submit an API proposal (in this issue, not code), with the new API to be added. We can iterate over that until we're happy, and then proceed to implementation.

One thing we do need to make sure is that the the extension is enabled for the database by running CREATE EXTENSION citus. This might be an auto configuration or we could require the user to use the Npgsql extensions API to create it.

The PG provider already allows you specify extensions to be enabled - this is being used for PostGIS and various other extensions. We could make that implicit based on the proposed Citus metadata API, but that's a pretty small point that shouldn't block us.

kirkbrauer commented 3 years ago

@roji Thank you for getting back about this issue!

I am a bit newer to Citus data as we are currently investigating using it on our project. That's why I preemptively created this issue in case we decide to go ahead with that in the future. However, I have been reviewing the docs, so I think I can propose a draft API and maybe somebody with a bit more experience can review it and suggest additions.

The API I'm currently envisioning looks like this:

// Distribute the 'Products' table by 'ShopId'
context.Entity<Product>().DistributeBy(x => x.ShopId);

// Distrubute the 'Variants' table by 'ShopId' and co-locate with 'Products'
context.Entity<Variant>().DistributeBy(x => x.ShopId).ColocateWith(context.Products);

// Make 'States' a reference table located on all nodes
context.Entity<State>().AsReferenceTable();

// Store 'Events' as append-only columnar data
context.Entity<Event>().HasAccessMethod(CitusAccessMethod.Columnar);

I'm open to suggestions on the naming conventions for these extension methods, for example, it may make more sense to follow the EF Core naming conventions:

context.Entity<Product>().HasDistributionProperty(x => x.ShopId);

roji commented 3 years ago

@kirkbrauer reading the docs, thing thing I'm noticing is that the Citus-specific stuff is implemented as function calls which are separate from the CREATE TABLE DDL, which itself is completely standard. This means that any provider support is largely optional, since users can already edit their migrations and include the Citus function calls via raw SQL. Of course, this doesn't mean we can't or shouldn't add provider support - just that it's a bit more optional.

Regarding your concrete API proposal, this looks like a good start! One thing I'd recommend, is to move all the Citus-specific API under a specific method and builder; this would make it clear that these methods are for Citus only (not at all obvious when looking at them), and would avoid polluting the Intellisense with multiple options which most users don't need to care about. So maybe something like the following:

context.Entity<Product>().ConfigureCitus(c =>
{
    // Distribute the 'Products' table by 'ShopId'
    c.DistributeBy(x => x.ShopId);
    ...
});

BTW I'm not seeing HasAccessMethod/Columnar in the linked docs - is there another doc page with that?

kirkbrauer commented 3 years ago

@roji That is a good point, thankfully Citus data is not a modification of the core Postgres DDL, which makes it much easier to deal with and does make support more optional. The main reason why we would want to add direct support would be due to Azure's adoption of Citus (Hyperscale) as one of their default Azure SQL Database tiers.

I like the idea to scope the Citus-specific configuration under a specific method and builder. I'm sure that would also make it much more readable overall, allowing anyone to immediately determine where those extension method come from. This would also allow us to add additional extension methods without worrying about polluting the base intellisense or causing conflicts in the future.

Here are the docs for the alter_table_set_access_method function defined by Citus Data: http://docs.citusdata.com/en/v10.0/develop/api_udf.html#alter-table-set-access-method.

roji commented 3 years ago

Great, thanks.

The overall approach seems good to me - we can always refine specific method names later. Is this something you'd like to work on and submit a PR?

kirkbrauer commented 3 years ago

I would, but first I would like to take a look at all the other methods in Citus Data, there are some specific configurations that we may need to be aware of via metadata so we can produce the correct migrations between them.

A good example of this would be un-distributing a table and verifying that there are no foreign keys that would need to be cascaded as part of the un-distribution process. The same applies for making a table to a reference table depending on if it's already distributed on a single shard or not.

roji commented 3 years ago

Yep. This could get complicated, with generating function calls when the distribution column changes, or various other things.

Since EF Core is optional here (not part of the CREATE TABLE DDL), I think that at some point it doesn't make that much sense to provide the EF sugar... in other words, the complexity of detecting all the possible changes and generating the correct function call may not be worth it, considering that users have the option to doing them in raw SQL.

kirkbrauer commented 3 years ago

It might be better to just settle for a basic declarative syntax, so users can at least configure their schemas for Citus. If they want to do a migration to modify the schema, we can handle the basic use cases, but they would have to provide the SQL to make any more complex modifications. We could always add additional scenarios based on user feedback.

roji commented 3 years ago

It's possible - we could make unsupported/advanced changes throw, forcing users to edit scaffolded migration files to replace the unsupported fragment with the corresponding raw SQL themselves. Though it's not an ideal scenario, and it would be surprising/unexpected for some things to be supported and not others.

I don't know the full scope of possible operations here. If it's large and complex, I'd lean towards leaving it to the user with raw SQL, rather than trying to model even a subset of it. This way we avoid lots of complexity which in the end is basically sugar, and users are very clear that Citus requires going raw SQL in migrations (which really isn't the end of the world).

RoystonS commented 7 months ago

I've just been attempting to get EFCore migrations generated for making tables distributed/reference, toying with ideas similar to those mentioned above. There are some difficulties:

One huge piece of complexity is that in order to make existing tables into distributed/reference tables, they basically can't have any foreign key constraints to or from other tables.

[This is because a) reference tables and local tables can't have foreign key constraints to distributed tables, and b) distributed tables can only have foreign key constraints to colocated hash distributed tables or reference tables.]

So if I want to make, say, one of my existing local tables a reference table and all the others distributed, I need to drop all foreign keys, make the tables reference/distributed, and then recreate all the foreign keys.

A Citus-specific raw SQL bit would be fine, to emit the custom SELECT statements, but the pain of dropping and recreating all the foreign keys would be very painful in raw SQL, not to mention duplicating a ton of logic from other migrations.

For CreateTable operations, the table needs to be created, then distributed, before foreign key constraints are applied to it (otherwise you have a local table potentially referencing a distributed table, which isn't allowed). For AlterTable operations, any table changing distribution state (e.g. going local->distributed, reference->local or whatever) needs to have any foreign keys to or from it dropped before making the state changes (to all related tables) and then recreating the foreign keys. If such an operation is happening in the same migration as a new foreign key being added, it's easy to end up generating SQL that adds that foreign key twice.

This issue has been around for a while now, so I'd be curious to hear what people are doing in practice?

roji commented 7 months ago

@RoystonS thanks for the additional detail on alter table... What you describe bears some similarity to some other migration scenarios where foreign keys need to be dropped and recreated; for example, some modifications of a SQL SERVER IDENTITY column require dropping and recreating the IDENTITY column, which requires recreating any foreign keys pointing to it.

So far, the EF team's thinking was that this shouldn't be attempted automatically via migrations, since the changes are far-reaching and potentially dangerous; similarly, changing a table to distributed in Citus sounds like quite a big change, requiring careful thinking/planning, and not to be automatically via a simple change in an EF annotation. So having users ultimately handle this via raw SQL in migrations still seems like the right thing - even though it's non-trivial.

RoystonS commented 6 months ago

@roji I've been thinking about this, and have a suggestion which may help unblock this:

You're completely right that dealing with the Citus distribution of an existing table is a serious business, requiring, as you say, careful planning.

However, it's not a very big deal at all if the distribution characteristics are known at the time of creation of a table. The table just needs marking as distributed between creating the table and applying indexes and foreign keys. I'm wondering if it might be worth considering making that case work, even if the general update case doesn't work. ("Perfect is the enemy of good".)

If we were to make some sort of statement that modifying the distribution state of an existing table isn't supported, and focus on the 'CREATE TABLE' case only, the problem becomes much more tractable: for the DDL it's just a case of inserting some SELECT statements between the table creation and index creation.

roji commented 6 months ago

@RoystonS doing only creation may make sense - but there's a larger point here. We frequently get people asking for EF migrations support for some corner of table creation. For one thing, it's simply unfeasible for EF to cover the entire scope of what all databases out there support; the range of possibilities generally pretty large even when considering a single database, but it expands even more as different databases have different possibilities.

Even more importantly, adding EF migration support really is little more than just .NET sugar over the user simply writing out the DDL themselves; so what lies behind these requests is frequently a mistaken conception that using SQL directly in migrations is somehow bad, although it's perfectly fine to do this.

So I think introducing nice EF migration modeling wrappers does make sense when the thing modelled is relatively easy/simple, and also when it's standard across databases; but I'm a bit reluctant to go off and do a lot of work - adding lots of code that needs to be maintained - just so that users can use .NET rather than SQL to specify some feature in migrations...

The above isn't really a "no" - this really is case-by-case. If you want to think through what nice modeling support for Citus CREATE TABLE would look like, definitely go ahead and post a proposal here if you think that makes sense - that would allow us to discuss a concrete thing.

npgsql / efcore.pg

Model support for Citus Data #1912