Problems with the 2100 parameter limit

marcperrinoptel commented 2 years ago

That SQL Server limitation is painful indeed; the current code to work around it involves temporary table #Temp_params, max_in_list_size, split_parameter_list_as_sql, but breaks very easily with simple test cases.

Here are 2 examples:

Test case 1


range_a = list(range(2200))
range_b = list(range(3000, 5200))

my_model .filter(a__in=range_a) .filter(b__in=range_b)

The behavior here is that:
      - temporary table `#Temp_params` is created > filled with `range_a` > deleted > filled with `range_b`
      - and the final query looks like:

SELECT (...) FROM (...) WHERE ( [a] IN (SELECT params from #Temp_params) AND [b] IN (SELECT params from #Temp_params) )

which is completely wrong, as `range_a` is entirely "lost"...

- **Test case 2**

Same query but this time:

range_a = list(range(1100)) range_b = list(range(2000, 3100))


The behavior here is that neither `range_a` nor `range_b` are big enough to bust `max_in_list_size` (= 2048 (reasonable number less than 2100 limit)), but together they do indeed produce a query with more than 2100 parameters.
i.e. `#Temp_params` is not involved at all, but we get an error (#07002).

- **Some thoughts**

The thing is, this `max_in_list_size` mechanism was implemented by Django with the Oracle case in mind (`max_in_list_size` = 1000, and the std django impl of `split_parameter_list_as_sql` is used), where the limitation is somewhat nicer than in our SQL Server case (cf. https://stackoverflow.com/a/17844383/16344677 for example; the IN clauses just need to be split using OR + IN)

It does not fit our SQL Server limitation well.

Also note that, even in more favorable test cases than the ones above, I observed that the creation or the temp table / insertion in it can happen an unnecessary number of times; also I assume that there are concurrency problems with this temp table.

At this point, I don't really have a fix idea, so I'm not _certain_ that we _cannot_ fit something in the current framework, but my guess is that at best it would be tough and ugly, and most probably it's outright impossible.

I guess we should devise a better mechanism, that ideally fits both our limitation and the more simple Oracle case, suggest it to django, and hope they don't reply that from their pov, everything's fine as it is, since they don't support SQL Server...

jmah8 commented 2 years ago

Thanks for raising this issue and giving a detailed explanation. We are discussing this issue internally.

jmah8 commented 2 years ago

We agree with what you suggested and are planning on doing some investigations. But if nothing comes from the investigation we may just improve on our current fix to at least fix the issues you raised, as we don't encourage people to use too many parameters.

marcperrinoptel commented 2 years ago

Yeah, as far as I know, the main scenario for which that mechanism was designed (prefetch) does not trigger those problems (concurrency could still be an issue though? in case insertion into + read from the temp table is not atomic); if you do run into those problems, you probably could have done better anyway (e.g. subquery instead of evaluation). Still, it's a gap with other backends / interesting to investigate.

federicoemartinez commented 7 months ago

This issue looks very ugly. Can we create a different temp table every time we are called? Something like:

def mssql_split_parameter_list_as_sql(self, compiler, connection):
    # Insert In clause parameters 1000 at a time into a temp table.
    lhs, _ = self.process_lhs(compiler, connection)
    _, rhs_params = self.batch_process_rhs(compiler, connection)
    table_name = "#Temp_params" + uuid.uuid4().hex
    with connection.cursor() as cursor:
        parameter_data_type = self.lhs.field.db_type(connection)
        Temp_table_collation = 'COLLATE DATABASE_DEFAULT' if 'char' in parameter_data_type else ''
        cursor.execute(f"CREATE TABLE {table_name} (params {parameter_data_type} {Temp_table_collation})")
        for offset in range(0, len(rhs_params), 1000):
            sqls_params = rhs_params[offset: offset + 1000]
            cursor.execute(f"INSERT INTO {table_name} VALUES " + ",".join("(%s)" for _ in sqls_params), sqls_params)

    in_clause = lhs + ' IN ' + f'(SELECT params from {table_name})'
    print(in_clause)
    return in_clause, ()

Aren't those tables dropped automatically?

federicoemartinez commented 6 months ago

@jmah8

Could we use table value parameters to handle this?

If I have a long list of ints, we can create a table type:


CREATE TYPE django_mssql_int_list AS TABLE
(
    [Id] INT NOT NULL
);

and then our query would be

in_clause = lhs + ' IN ' + '(SELECT id from %s)'
# I'm not sure if this is how you pass a custom type table value
return in_clause, (tuple(["django_mssql_int_list"] + [[x] for x in rhs_params]),)

jmah8 commented 5 months ago

@federicoemartinez

I haven't worked on this project for a long time. I recommend checking with someone currently working on the project as they would be a better person to ask.

bobince commented 5 months ago

We just independently discovered this (query with two different long in filters applying the same param list to both). IMO this is a really serious problem and quietly doing the wrong thing is much worse than exploding due to the 2100 limit. (One of those in lists could very well be a row permissions check, for example...)

Maybe having randomly-named temp tables is a plausible hack, but honestly the whole #Temp_params mechanism feels pretty doomed to me. It generally feels pretty bad to be doing database access inside as_sql, so that we're hitting the server every time we prepare the query string (even just for logging purposes etc).

split_parameter_list_as_sql was intended to solve a different, IN-specific problem in another DBMS, not as a general workaround for parameter limits. Limits also affect other DBMSes and are not worked around in their backends; I think if this is to be tackled it should be done consistently for all backends by Django, like it was for eg bulk_create.

gagtomm1 commented 3 months ago

I haven't investigated the performance implication, but I've noticed that Entity Framework uses OPENJSON subqueries to pass parameters list, similarly to what is described there: https://stackoverflow.com/a/60705570

Following that approach, the query from the first post could be made to use only 2 parameters and look something like:

SELECT (...)
FROM (...)
WHERE (
    [a] IN (
        SELECT [f0].[value]
        FROM OPENJSON(@param1) WITH ([value] int '$') AS [f0]) 
    AND
    [b] IN (
        SELECT [f1].[value]
        FROM OPENJSON(@param2) WITH ([value] int '$') AS [f1]))
)

bobince commented 3 months ago

That's an interesting approach that seems less immediately doomed.

There'd obviously be some restrictions on what the types of the parameters can be though if they have to go through JSON.

(Also it's only available from SQL Server 2016.)

microsoft / mssql-django

Problems with the 2100 parameter limit #156