paillave / Etl.Net

Mass processing data with a complete ETL for .net developers
https://paillave.github.io/Etl.Net/
MIT License
668 stars 92 forks source link

Reusing DB context issue with large collections #438

Open ctrlaltdan opened 1 year ago

ctrlaltdan commented 1 year ago

Hi all,

I've been using the EfCoreSave operator with SaveMode.EntityFrameworkCore option enabled as I'm planning to write my data to Postgres.

I've noticed when dealing with large streams, this operator creates a bottleneck for two reasons:

  1. A single TPT is used to write the data
  2. The DbContext is never disposed of, therefore creating a large quantity of unused entities in the change tracker (albeit in with "detached" state).

I've changed a local copy to look as follows:

var ret = args.SourceStream.Observable
    .Chunk(args.BatchSize)
    .Map(i => i.Select(j => (Input: j, Entity: args.GetEntity(j))).ToList())
    .Do(i =>
    {
        var dbContextFactory = args.KeyedConnection == null
            ? this.ExecutionContext.DependencyResolver.Resolve<IDbContextFactory<TMyContext>>()
            : this.ExecutionContext.DependencyResolver.Resolve<IDbContextFactory<TMyContext>>(args.KeyedConnection);
        using (var dbContext = dbContextFactory.CreateDbContext())
        {
            ProcessBatch(i, dbContext, args.BulkLoadMode);
        }
})

MS Docs on DbContextFactory

This allows multiple streams to write to the database at the same time, and crucially, disposes the DbContext after each operation (considerably improving performance/memory overhead in the change tracker).

On my sample of 500k+ entries the import is now silky smooth.

Was there a deliberate design decision behind this?

paillave commented 1 year ago

Hello, It is a very good suggestion indeed, even if the implementation you bring here has a flaw (in case of several operators working against EF core, InvokeInDedicatedThreadAsync ensures no call is done in parallel since it is not supported). I will work on a amendment on this regard. Thanks a lot for this! :smiley:

ctrlaltdan commented 1 year ago

Great to hear!

I eagerly removed the dbcontext instance from the thread lock as I assumed the node instance was threadsafe, and the dbcontext instance was local to the batch anyway.

My thinking that parallel queries are supported, just not within a single dbcontext. Similar to how the .net request pipeline doesn't enforce singleton access to a dbcontext.

Thanks for picking this up. I'm enjoying learning about your framework.

paillave commented 1 year ago

Operators are meant to work in parallel, (that's the purpose). But EF core raises an exception when a query is already being run on the same DbContext in a concurrent thread (this is for this very purpose I had to develop the method InvokeInDedicatedThreadAsync).

Just as an extra information, in the current implementation, the fact the dbContext is not disposed it is absolutely normal. It is not up to an ETL operator to decide that it must be trashed as it has been created in a much higher level context.

Of course, if you issue a different DbContext at each use, everything that is mentioned above is irrelevant, but in most of situations a DbContext is directly given to the ProcessRunner. So what you suggest will be another way to work: if a IDbContextFactory it will be used instead.

About proper release of entities, I will make a quick win: execute a dbContext.ChangeTracker.Clear() at every batch unless the option KeepChangeTracker is activated. This will permit entities to be released from memory if they are not used anywhere else.

paillave commented 1 year ago

a new version 2.1.6-beta is available. I didn't test it too well to speak franckly. You can give it a try. I should work with IDbContextFactory out of the box. If you want to specify the type of the DbContext, you must call .WithContextType<your-dbContext-type>() on the options builder. Let me know if it suits you.