Optimisation of SQL generation logic

luber commented 1 year ago

Whenever some method of PostgreSqlJobQueue is executed - a new instance of the String class containing formatted SQL is created. That puts some footprint on the memory:

B/c SQL strings depend on _storage.Options, it would be more optimal to create those only once - during PostgreSqlJobQueue class construction...

frankhommers commented 1 year ago

A PR for this is very welcome!

azygis commented 1 year ago

I'll close this.

It really needs more context. Job queue is only one place which has the formatted strings.

1-3mb of memory doesn't sound like a huge footprint for us to sacrifice the readability. If someone disagrees, I'm open for discussion. As it is right now, I do not see real value.

azygis commented 1 year ago

Reopening after further investigation. Will do this different from what was the suggested PR since it's not just the job queue that has many raw SQLs. The job queue, though, formats the string in every dequeue loop iteration, making the allocations worse.

The idea is right though - since we always format the query with schema name, it's fine to do that once, although I will use memcache to store the queries for a period of time, which allows cleaning up queries that are not used (e.g. if someone checks dashboard and then the queries are no longer needed).

azygis commented 1 year ago

Went through all SQL queries and started caching them with #300. It helps a little, but not much - allocations stopped only on our side. Reason being is that we're using Dapper. Since PostgreSQL does not support named parameters, Npgsql has to parse the strings internally and replace the placeholders with positional parameters, meaning extra allocations are necessary. Unfortunately, Dapper does not support positional parameters and they are still thinking about it.

I might be willing to try rewriting the library to drop Dapper and going pure ADO.NET (in case Dapper devs think about it for too long), but not sure what others would feel about that.

@frankhommers @vytautask @Plasma

frankhommers commented 1 year ago

Hmm, I like Dapper (a lot). Is the benefit so huge that it'll justify the move away from it?

azygis commented 1 year ago

For me - yes, anything that avoids unnecessary allocations is better, but that's only my personal opinion. "Huge" would be hiperbole - it's not that big of a difference, hence me asking for other points of view.

I might try rewriting one part which has the biggest cost - Dequeue method - and checking the difference, that might show a better picture. Locally, of course, I don't want to have mismatching stuff released. Consistency is key.

frankhommers commented 1 year ago

Or could we just write one or a few Dapper like extensions that do use the positional parameters? I guess I would just do the positions based on alphabetic sort order.

azygis commented 1 year ago

Well, that pretty much was my idea - just have a few extension methods (same as Dapper has) for our own use using plain ADO.NET provider, without a proxy like Dapper.

You don't really want to use a hard sort order or anything like that. Positional parameters are similar to string.Format - they are just placeholders in whatever order you specify. You see the query, and you see how you pass the parameters (either to params NpgsqlParameter[] @params or just simple array). Easy to correlate $1, $2 etc with your parameter args.

The only problem with this (in my opinion) is probably type "inference"/conversion. I guess you still want to give the extension methods whatever CLR types you want, and then getting them translated to NpgsqlParameter automagically. For me, from the library point of view, that's unnecessary and I'm fine just making instances of the parameters manually, but you might disagree that it then is not even close to Dapper.

dmitry-vychikov commented 1 year ago

I like performance improvements, but we need to set goals at least. The fact is – you can not avoid allocations completely in .NET. Yes, allocations are not good because of gc, so you want to reduce their amount.

GC is cheap for Gen0 and Gen1 (relatively). If the strings are short-lived, you might not see much difference after eliminating the allocations.

If you have enough time and willingness, then consider the following ideas:

Ideas: 1) Measure performance, not only memory. For instance, create a test which deques 10,000 jobs. Measure number of gen 0,1,2 allocations. Measure execution time, see CPU profiler results. You might find out there are other things affecting performance more than SQL strings. 2) Try applying a fix for one place, see what order of magnitude difference you get.

At first glance, I do agree SQL strings worth spending time on, but I can't say anyone will see major difference.

Regarding the actual solution – I would go for ADO.NET and positional arguments to fix this completely.

The only problem with this (in my opinion) is probably type "inference"/conversion. I guess you still want to give the extension methods whatever CLR types you want, and then getting them translated to NpgsqlParameter automagically

Boxing arguments for parameters should not make big difference at all, while strings are more important because of their size (10s of bytes for typical objects vs. 100s of bytes for SQL strings)

azygis commented 1 year ago

I definitely agree that some sort of profiling is needed, and not just memory, as you say. I just want to avoid spending time if decomissioning Dapper is considered a big no-no. I'm honestly more than willing to play around with all that. Time is a different beast, but eventually I should find some.

Boxing arguments for parameters should not make big difference at all, while strings are more important because of their size (10s of bytes for typical objects vs. 100s of bytes for SQL strings)

Oh, I'm more about setting the correct NpgsqlDbType based on what you pass to the methods, after unboxing the values. Although we don't really use that many types in this lib, therefore it might just be fine handling only what's really needed.

Plasma commented 1 year ago

Apologies for my delayed contribution, I've had a look based on the above discussion and have a few thoughts:

In order to avoid the string concat/allocation due to the schemaName being inserted for each query, can we update the connection's Search Path / default schema to be the Hangfire schema, and drop the need to insert the schemaName into the query? This would mean we shouldn't then have to be adjusting the query each time (or cache it) and it would be a constant string.
Perhaps instead of dropping dapper, could we contribute support for positional parameters to Dapper (as mentioned in https://github.com/DapperLib/Dapper/issues/1782 ) and then take advantage of that from this project?

azygis commented 1 year ago

Regarding 1, I'm not sure how I feel about altering the connection string without "consent". Sure, it's just a search path, but it's still altering the connection string (or the options of the connection, even, if someone passes the connection instead of string) that the user passes. I'm more on the "don't want to" side on this.

Regarding 2: I'm not against it, sure, but I would not have time for it, since making the changes in Dapper warrants checking a bunch of other providers that do (or don't) support the positional parameters. If someone is up for it, it would be awesome, since this would require way less changes here.

hangfire-postgres / Hangfire.PostgreSql

Optimisation of SQL generation logic #278