npgsql / efcore.pg

Entity Framework Core provider for PostgreSQL
PostgreSQL License
1.53k stars 223 forks source link

Switch to UUIDv7 for client-side GUID generation #2909

Closed tparvi closed 1 month ago

tparvi commented 11 months ago

In the past there have been few issues related to guid generation:

The issue https://github.com/npgsql/efcore.pg/issues/105 says We're dropping sequential Guid generation because PostgreSQL doesn't do clustered indices

The issue https://github.com/npgsql/efcore.pg/issues/2414 is closed because there really wasn't any proof/documentation provided that using sequential guids is beneficial

I ran into the following article that compared random uuids vs sequential and found an issue with random uuids: https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/

I guess my question has multiple parts

a) Does that article have any merit? I am not postgres expert so I have no way of knowing is this is a real issue or can I just keep using Guid.NewGuid() b) We are using Guid as primary key and generating it with Guid.NewGuid. There is also this GuidValueGenerator but documentation says it can only be used to generate non key values: https://www.npgsql.org/efcore/modeling/generated-properties.html?tabs=13%2Cefcore5#guiduuid-generation Is there a way to use that value generator directly like var id = GuidValueGenerator().Generate(). We need to generate the id client side. c) If sequential uuids would be beneficial are there any recommendations on how to generate them? There are of course loads of libraries that can generate e.g. v7 uuid like https://github.com/mareek/UUIDNext

roji commented 11 months ago

@tparvi thanks for posting this; this question actually came up several times in some other contexts as well, and I've been meaning to look into it.

I do think there's probably something here, and am reopening to consider re-introducing some form of client-side sequential GUID generation for better perf. I've done a very quick test using the built-in EF SequentialGuidValueGenerator, following the steps in the above blog post; it did not improve performance in any way (i.e. the number of buffers was the same). I don't remember the details right now, but SequentialGuidValueGenerator was specifically crafted to match SQL Server sequential GUID generation, and so probably doesn't make sense for PG. But we should try to use a .NET UUIDv7 generator instead (e.g. based on the above-suggested UUIDNext), and at that point we should be able to get the same performance benefits as in the post.

If anyone wants to take a stab at this, please do!

In the meantime, remember that you can simply do server-side UUID generation; this should work just as well, except for scenarios where you're inserting both principal and dependent in the same SaveChanges, at which point an extra roundtrip is needed to fetch the principal's UUID (in order to send it as the foreign key of the dependent).

IMPORTANT: https://github.com/dotnet/efcore/issues/30753 seems to indicate that for SQL Server, client-generation of GUIDs may not be ideal for perf in any case. I'm not sure if that would apply to PG as well (would need to investigate), but users always have the option of installing the pg_uuidv7 extension and setting up uuid_generate_v7() as the column default value (for server-side UUIDv7 generation).

PG test program with SequentialGuidValueGenerator ```c# await using var ctx = new BlogContext(); await ctx.Database.EnsureDeletedAsync(); await ctx.Database.EnsureCreatedAsync(); const int totalRows = 10_000_000; const int batchSize = 1000; for (var i = 0; i < totalRows / batchSize; i++) { Console.WriteLine($"Batch {i} of {totalRows / batchSize}"); for (var j = 0; j < batchSize; j++) ctx.Blogs.Add(new Blog()); await ctx.SaveChangesAsync(); ctx.ChangeTracker.Clear(); } public class BlogContext : DbContext { public DbSet Blogs { get; set; } protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) => optionsBuilder.UseNpgsql("Host=localhost;Username=test;Password=test"); protected override void OnModelCreating(ModelBuilder modelBuilder) { modelBuilder.Entity().Property(b => b.SequentialGuid) .HasValueGenerator(typeof(SequentialGuidValueGenerator)); } } public class Blog { public int Id { get; set; } public Guid RegularGuid { get; set; } public Guid SequentialGuid { get; set; } } ```
ProTip commented 7 months ago

There has been some buzz around this in recent months.

Benchmarks: https://ardentperf.com/2024/02/03/uuid-benchmark-war/

Lingering v7 support patch: https://commitfest.postgresql.org/47/4388/

roji commented 7 months ago

Thanks @ProTip, that's useful!

We could take a dependency on https://github.com/stevesimmons/uuid7-csharp (thanks @Brar), or https://github.com/mareek/UUIDNext. https://github.com/dotnet/runtime/issues/88290 tracks adding UUIDv7 generation in .NET itself.

One possible thing to keep in mind is that UUIDv7 isn't yet a finalized standard (still in draft). But if the generation functions make it into PG itself, that should be good enough for us too.

In the meantime, anybody who wants UUIDv7 can easily wrap one of the above .NET libraries with an EF value generator.

cremor commented 2 months ago

dotnet/runtime#88290 tracks adding UUIDv7 generation in .NET itself.

This has been superseeded by https://github.com/dotnet/runtime/issues/103658 which was completed recently. .NET 9 will contain Guid.CreateVersion7().

roji commented 2 months ago

Thanks @cremor - that's indeed perfect for our needs. We should switch to this for 9.

cremor commented 2 months ago

Will Npgsql.EntityFrameworkCore.PostgreSQL v9 actually target/require .NET 9? It looks like EF Core 9 will only target .NET 8 (just like EF Core 7 only targeted .NET 6).

roji commented 1 month ago

EF 9 (and therefore EFCore.PG 9 as well) will indeed target net8.0, not net9.0; this means we can't yet use the new BCL API for generating UUIDv7. So we can either:

  1. Temporarily copy UUIDv7 generation code into the provider (I don't want us to add a full nuget dependency for this), and then remove it in 10. The implementation would need to be exactly the same as the net9.0 runtime one, since we don't want changes between 9 and 10; essentially I think we'd copy in the runtime implementation, assuming that's feasible.
  2. Defer this issue for 10 altogether.

Is someone interesting in tacking (1), possibly @Timovzl?

Timovzl commented 1 month ago

@roji How strong is the need to avoid changes between 9 and 10? The reason I ask is that it is very possible that .NET 10 will see improvements to the UUIDv7 generator, especially to make it monotonic intra-millisecond. In the last paragraph of his comment, Tanner Gooding states:

These options can help guarantee uniqueness for a given UUID generator, but they are more advanced and were considered out of scope for the BCL to provide in .NET 9 given the limited timeframe before we lock down for RC1.

Under the same issue, see also my own recommendation for the implementation. Personally, I'm especially concerned with intra-millisecond monotonicity and (limited) resilience to clock adjustments, both of which I'll be pushing for.

Does this affect the direction the EF team would like to go in for now?

roji commented 1 month ago

How strong is the need to avoid changes between 9 and 10?

Let's put it this way... If EFCore.PG targeted net9.0 today, I'd just use the new .NET API; if it then changed between 9 and 10, that would obviously affect the GUIDs produced by EFCore.PG. If I understand things correctly, adding intra-millisecond monotonicity in 10 wouldn't "invalidate" or "obsolete" the GUIDs produced by the 9 implementation - it would simply improve their quality.

Or put another way, it's probably better to start producing whatever .NET 9 produces in EFCore.PG 9, rather than continuing to produce random GUIDs as currently, no?

ChrisJollyAU commented 1 month ago

@roji I've just done something like this for EFCore.Jet. See CirrusRedOrg/EntityFrameworkCore.Jet#254

Used the original SequentialGuidValueGenerator together with the stuff they did for CreateVersion7 at https://github.com/tannergooding/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Guid.cs#L304

It looks right and I finally have the test Can_use_sequential_GUID_end_to_end_async working

roji commented 1 month ago

Thanks @ChrisJollyAU! Is this copy-pasted from the runtime code, or your own implementation?

ChrisJollyAU commented 1 month ago

A bit of both. If you look at the CreateVersion7 function it actually references some private readonly variables in an internal struct (_a,_b,_c,_d). Since I just have bytes (rather than the int, short etc that the internal variables they use).

You can see for _c and _d the code is basically the same. With _c the main difference is I query the 2 bytes from the byte array, do the modification like they do, then write it back to my array. Compared to them having _c in the struct already as a short than can be used directly

The unix timestamp part should end up the same but just done with a byte array (versus an int and short in a struct - same length 48 bits = 6 bytes)

The top section mostly deals with the counter to create the timestamp. Bascially comes from the original SequentialGuidValueGenerator in EFCore.

By the looks of it you do need the _counter. Its later used to create the timestamp needed. You don't get a sequential guid otherwise

I have tried setting up a value generator for the above mentioned test like

public class UUIDv7Generator : ValueGenerator
{
    protected override object NextValue(EntityEntry entry) => Guid.CreateVersion7(DateTimeOffset.UtcNow);

    public override bool GeneratesTemporaryValues { get; }
}

And adding this to the context (BronieContext)

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    base.OnModelCreating(modelBuilder);
    modelBuilder.Entity<Pegasus>().Property(e => e.Id).HasValueGenerator<UUIDv7Generator>().ValueGeneratedOnAdd();
}

As the tests are running on .Net 9 this is available. Unfortunately it didn't produce fully sequential

Timovzl commented 1 month ago

Nice work, @ChrisJollyAU.

I just wanted to clarify that the need for the generator to be monotonic (intra-millisecond) is not being addressed by the initial .NET 9 implementation of UUIDv7. While it is possible to add logic on top of that, that's equivalent to rolling a custom implementation.

Do we prefer a custom implementation, or the .NET 9 implementation with its not-entirely-sequential nature?

roji commented 1 month ago

Do we prefer a custom implementation, or the .NET 9 implementation with its not-entirely-sequential nature?

@Timovzl as I wrote above, if we targeted net9.0 today, I think we'd almost certainly use the .NET API - unless you think intra-millisecond is particulaly important in the context of PG, justifying a diverging implementation; I'd just rather not have to maintain this kind of thing within an EF provider. Similarly, in 10 we're almost sure to target net10.0, at which point we'd switch to the runtime implementation for the same reason.

So I'm leaning towards just using whatever .NET 9 does, and then switching to the runtime API next year. But I'm open to a custom implementation if you really believe that's important, and if there's a good chance that would also go into the runtime soon, allowing us to once again remove our custom implementation at some point.

I guess I'll defer to you guys overall - a PR for this would also be very helpful, as there's tons of other stuff happening at the moment.

Timovzl commented 1 month ago

I'll have to reason out loud and get some input on this, as I'm not a user of the feature. I like to have my IDs on construction. :smile:

Starting from the ideal of not having to maintain an ID generator in an EF provider, would the .NET 9 implementation suffice? The latest discussions color it unlikely that .NET 10 will see built-in intra-timestamp monotonicity either, so we'd have to be prepared to live without it.

Performance

With timestamp-based IDs that are shuffled intra-millisecond, all insertions will generally go into the last or second-to-last page, because we're always inserting something at least close to the greatest existing ID. We would have most of our performance gains.

Points of attention:

Least astonishment

If you Add or AddRange 10 invoice lines, under the assumption of a sequential ID generator, you'd expect them to receive IDs that match the input order. This would be missing. How bad is that? To me, it is unacceptable, but I'm a stickler for this sort of least astonishment.

roji commented 1 month ago

If you Add or AddRange 10 invoice lines, under the assumption of a sequential ID generator, you'd expect them to receive IDs that match the input order. This would be missing. How bad is that? To me, it is unacceptable, but I'm a stickler for this sort of least astonishment.

I don't think that's a valid assumption; the sequential nature of the GUIDs we'd produce here is for optimizing indexing in PostgreSQL, and not any sort of guarantee as to the contents of the actual GUID contents as seen by the user. Users should generally not care about the GUID contents (and in any case, nobody has complained so far about about the totally random GUIDs we're currently producing).

Other than that, it sounds like we're fine going with the current .NET 9 implementation? If so, does either one of you want to submit a PR?

ChrisJollyAU commented 1 month ago

I've just been playing around with a couple of things. Wrote myself a test to see how my function compares to the .Net 9 function

public void mynet9guidtest()
{
    DateTimeOffset dtoNow = DateTimeOffset.UtcNow;
    Guid net9internal = Guid.CreateVersion7(dtoNow);
    Guid custom = Next(dtoNow);
    var bytenet9 = net9internal.ToByteArray().AsSpan(0, 6);
    var bytecustom = custom.ToByteArray().AsSpan(0,6);
    Assert.Equal(bytenet9,bytecustom);
    Assert.Equal(net9internal.Version,custom.Version);
    Assert.Equal(net9internal.Variant,custom.Variant);
}

Under the assumption that given the same timestamp, the timestamp part should be the same and the version and variant as well. Main thing was the order of bytes for the timestamp part was a bit off. Easy fix though. Version is correct. Variant is weird. Feels like I'm looking at Schrodingers Cat. Run the test normally as release. Passes every time. Run in Debug and it fails.

Any thoughts on going with the standard .Net 9 implementation as default but have an option to enable the intra-millisecond functionality?

While you can write tests like Can_use_sequential_GUID_end_to_end_async and have the above scenario with adding the 10 invoice lines, but outside of that and being somewhat pedantic about the order, is there much real world expectation (or wish) that efcore.pg produces perfectly sequential guid's that align with the order that it was created

Timovzl commented 1 month ago

@roji I'm looking into creating a PR.

The main challenge seems to be that there is a ton of repetition. NpgsqlModelBuilderExtensions, NpgsqlPropertyExtensions, NpgsqlModelExtensions, NpgsqlAnnotationNames, and NpgsqlPropertyBuilderExtensions all need to be changed. Moreover, unrelated methods like UseSerialColumns explicitly disable specific pieces of competing functionality:

property.SetHiLoSequenceName(null);
property.SetHiLoSequenceSchema(null);

All such places need to account for the existence of the new option, since there no abstraction was used to disable everything before enabling the intended thing. This will take some time, and require careful review. :wink:

ChrisJollyAU commented 1 month ago

@Timovzl Just pushed an update to my PR on EFCore.Jet which fixes the timestamp byte order

If you use the Next function in the equentialGuidEndToEndTest that is the one without the counter for intra-millisecond perfectly sequential Guid aka equivalent to the .Net 9 version.

Test above it passes. Worked out what was going on with the variant. The .Version property reads it as a 4 bit value and doesn't mask out the "don't care" bits. See table at https://datatracker.ietf.org/doc/html/rfc9562#variant_field . In the end it was only differeing in the "don't care" bits

Sounds like you have some other stuff disabled for Guids. All I was going to do was basically the same as EFCore.Jet: Add the new sequential guid value generator and pdate the FindForType in NpgsqlValueGeneratorSelector to use the new generator

roji commented 1 month ago

All such places need to account for the existence of the new option [...]

I think you may be going in the wrong direction here. Take a look at the SQL Server provider: it uses SequentialGuidValueGenerator by default (which generates SQL Server-specific sequential GUIDs, not UUIDv7) - is there any reason not to do the same, and add a Uuid7ValueGenerator which would be used by default? Otherwise what design do you have in mind, what kind of option?

ChrisJollyAU commented 1 month ago

@roji That's the same as where I was going. See the PR just created

Timovzl commented 1 month ago

@roji Ahh, you simply wanted to replace the default. Yeah, that's a lot easier. 😆

@ChrisJollyAU, let's swap notes. I did a pure copy of the .NET 9 code, with a minimal change where I reinterpret the Guid as a custom GuidDoppleganger struct in order to be able to reach the private fields. Needing the struct is mildly convoluted, but it virtually keeps us from having to change/rewrite the method, staying close to the source material instead. Interesting trade-off.

Gotta go now. Will share code next Monday.

roji commented 1 month ago

Thanks - @Timovzl can you please take a look at #3249?