Large IN filters cause trouble

beruic commented 6 years ago

I know that it seems there is probably more than one bug here, but please bear with me. This was all discovered during the same chain of events, ans aside from the number of elements, nothing has changed between requests. I suspect the solution to all issues might be the same, but hope that someone with more insight, might come through.

I am running some updates on a large amount of instances of a very simply model. The model is:

class Term(models.Model):
    term = TermField(max_length=64, primary_key=True)
    enabled = models.BooleanField(default=True)

    added = models.DateField(
        default=timezone.now,
        help_text=_('The time when this entry was added.')
    )
    verified = models.DateField(
        default=timezone.now,
        help_text=_('An entry is considered verified every time it is saved while enabled.'
                    ' This date will update automatically upon save.')
    )

I have a bunch of instances (about 127000) and sometimes I want to update the verified field on a bunch of them, so I do Term.objects.filter(term__in=term_set).update(verified=timezone.now()) where term_set is a set of terms (strings) to update.

Any time my term_set contains more than 2096 elements, I get an exception. I have verified that it is the number of elements that is the problem, by transposing the selection of the elements for the set, from a complete list of my terms.

The exception however varies a bit on the number of elements. At 2097 to 2099 elements I get:

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 548, in execute
    return self.cursor.execute(sql, params)
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003) (SQLExecDirectW)')

Now this suggests a solution. More on this later. At first I thought this was because I used an SQL Server 2017 Express, but then I tried on a database on our production server, which is 2012 Standard, with same results.

From 2100 to 32766 I get

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 548, in execute
    return self.cursor.execute(sql, params)
pyodbc.Error: ('07002', '[07002] [Microsoft][ODBC Driver 17 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')

And from 32767 and onwards I get

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 548, in execute
    return self.cursor.execute(sql, params)
pyodbc.ProgrammingError: ('The SQL contains -32768 parameter markers, but 32768 parameters were supplied', 'HY000')

This last one varies on the numbers. Here is for 50000 (Relevant line only)

pyodbc.ProgrammingError: ('The SQL contains -15535 parameter markers, but 50001 parameters were supplied', 'HY000')

and 100000

pyodbc.ProgrammingError: ('The SQL contains -31071 parameter markers, but 100001 parameters were supplied', 'HY000')

The N + 1 parameters fit with that the SQL that gets generated is in the form SET NOCOUNT OFF; UPDATE [core_givenname] SET [verified] = ? WHERE [core_givenname].[term] IN (?, ?, ..., ?).

It should be noted that all of these outputs have been shortened to the relevant parts, since all of them has a caused exception below the ones I have included. Here is the full output for 100000 elements:

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 548, in execute
    return self.cursor.execute(sql, params)
pyodbc.ProgrammingError: ('The SQL contains -31071 parameter markers, but 100001 parameters were supplied', 'HY000')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-778f53769929>", line 7, in <module>
    GivenName.objects.filter(term__in=term_set).update(verified=timezone.now())
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/models/query.py", line 647, in update
    rows = query.get_compiler(self.db).execute_sql(CURSOR)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 1204, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 79, in execute
    return super(CursorDebugWrapper, self).execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 548, in execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: ('The SQL contains -31071 parameter markers, but 100001 parameters were supplied', 'HY000')

I use ODBC Driver 17 for SQL Server but I have tried with version 13.1 as well, with the same results. I have also confirmed the behaviour bug is present django-pyodbc-azure versions 2.0.1.0 with Django 2.0.2, but most of my testing and fixing effort has been done on django-pyodbc-azure 1.11.9.0 and Django 1.11.10, as those are the versions I primarily use. I have briefly touched FreeTDS, which in version 0.9.1 (Ubuntu default) seems to raise the number for how many elements that can successfully go through, but it was bothersome getting to work, and raise other issues that could not easily be fixed, since updating to a newer version of FreeTDS is out of scope for me at the moment.

One thing I did try to do to fix this, was to apply the parameters to the SQL directly in python, and only send the SQL, which actually seems to work reasonably, but at some number of elements, your query run out of internal memory. Besides, I'm pretty sure my modifications are bug-prone - at least under Python 2 (Which Django 1.11 still supports, but django-pyodbc-azure could of course drop support there anyway). In any case, my modified code for CursorWrapper in base.py is here:

class CursorWrapper(object):
    ...
    def format_params(self, params):
        fp = []
        if params is not None:
            for p in params:
                if isinstance(p, text_type):
                    p = p.replace('\'', '\'\'')  # Escape existing quotes
                    p = '\'' + p + '\''  # Enclose in quotes
                    if self.driver_charset:
                        # FreeTDS (and other ODBC drivers?) doesn't support Unicode
                        # yet, so we need to encode parameters in utf-8
                        fp.append(smart_str(p, self.driver_charset))
                    else:
                        fp.append(p)

                elif isinstance(p, binary_type):
                    fp.append(p)

                elif isinstance(p, type(True)):
                    if p:
                        fp.append(1)
                    else:
                        fp.append(0)

                else:
                    fp.append(p)

        return tuple(fp)

    def execute(self, sql, params=None):
        self.last_sql = sql
        # sql = self.format_sql(sql, params)
        params = self.format_params(params)
        sql = sql % params
        self.last_params = params
        try:
            return self.cursor.execute(sql)
        except Database.Error as e:
            self.connection._on_error(e)
            raise
    ...

Now back to the seeming limit of 2100 parameters. If this is really the culprit (and it is indicated from here, here, and here), I guess we have to break up the updates in multiple executions, by compiling a chunk sized SQL for most of the updates, and send it to CursorWrapper.executemany, and the rest to CursorWrapper.execute, and then sum the counts for updates. I cannot imagine this having super performance, compared to doing it in one go, but it probably beats running out of memory.

This solution involves fixing the SQL compiler, and I am not very much into that. I'd wish there was just a place you could tell it what the max parameter count was, to the default compiler implementation have it automatically split it up. I do not think this affect other areas than IN clauses, since I have not had any problems updating otherwise, and bulk inserting doesn't use parameters (and I have tried with over 200000 elements, which is why I wondered a lot over this issue until I figured this out).

A future optimization of the above, might involve sending the SQL alone, and in the SQL prepare the terms as a temporary in-memory table, but let's make it work first, and then optimize later.

Sorry if I have mistakes in my writing here. It's very late, and I am off to bed now :)

beruic commented 6 years ago

After a good night of sleep, I have been giving this some more thought.

I think the right solution to apply to this issue now, is to simply raise a meaningful exception when an in clause has more than 2000 items. This is by far the easiest solution for now, until something further is done to mitigate this, and then it is trivial to split up ones own code.

I still find it feasible to split up the request on the IN parameters, but this is hard, because other split-ups must be taken into consideration. Alternatively the parameters can be typed out for each chunk, as I did in my modified code above. this will make it much easier to handle other clauses, but will make it impossible to use executemany, and we will lose the potential optimizations from this one.

A third solution might be to split up the IN clause in the same SQL statement, as well as typing it out. I will try this out before I need to move on.

I still find it strange that this is an issue for MSSQL, and not any of the other database backends I am using.

beruic commented 6 years ago

Ok, I have applied my above hack for typing out the parameters, as well as added DatabaseOperations.max_in_list_size.

class DatabaseOperations(BaseDatabaseOperations):
    ...
    def max_in_list_size(self):
        return 2000
    ....

Defining DatabaseOperations.max_in_list_size to 2000 does that for every 2000 parameters a new IN clause will be OR'ed on the query, like IN (?, ?, ...[1997]..., ? OR IN ?, ?, ...[1997]..., ? OR IN ...).

Unfortunately this makes no difference, and the hack for typing out the parameters seems the most efficient. It runs rather fast too, but fails on more than 32314 elements on both my servers (2017 Express and 2012 Standard), so I suspect the limit to be a default setting in SQL Server, since memory, does not spike during the operation. The exception it throws on 32315 or more elements is:

Traceback (most recent call last):
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 551, in execute
    return self.cursor.execute(sql)
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information. (8623) (SQLExecDirectW)')

About the 2000 limit in my previous comment, I guess that is hacky as well, and cannot be trusted. I suspect however, that the first exception of the original three, that ends off with

pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003) (SQLExecDirectW)')

Is the correct error message in any case for this issue, and work needs to be done to figure out why the others arise. For now I am splitting up the update in my own code, until we figure out how to fix this in django-pyodbc-azure. Personally I think the best approach is to figure out how to split up execution to executemany can be utilized.

michiya commented 6 years ago

Seems to be the same as #101.

charettes commented 5 years ago

I'm not sure how SQL Server treats query params but .features.max_query_params probably needs to be set to 2000 as well.

beruic commented 5 years ago

Can you elaborate @charettes, for those of us who are not completely sure what you refer to?

charettes commented 5 years ago

@beruic I'm referring to this flag

https://github.com/django/django/blob/41e73de39df31c4b13d65462bfeedde6924226d8/django/db/backends/base/features.py#L91-L92

scnerd commented 5 years ago

I've tried tweaking that flag, both as a hot-patch at the beginning of my code:

from sql_server.pyodbc import features
features.DatabaseFeatures.max_query_params = 2000

And by just appending that property to the end of the file:

# cat /opt/conda/lib/python3.6/site-packages/sql_server/pyodbc/features.py
from django.db.backends.base.features import BaseDatabaseFeatures

class DatabaseFeatures(BaseDatabaseFeatures):
    # ....
    max_query_params = 2000

And I'm still getting the errors described above:

ProgrammingError: ('The SQL contains 13095 parameter markers, but 144167 parameters were supplied', 'HY000')

charettes commented 5 years ago

Ok so here's the deal. From my understanding there's no way to perform a IN (id0, ..., idN) in a single query on SQL Server (or any backend with such limitations) where N > 2000. You'll have to either tweak your servers configuration if you want to allow more or live with it; the driver simply doesn't allow you to pass more than N parameters so there's no way idN+1 will make it through.

Now for @scnerd issue regarding the usage of prefetch_related which relies on IN queries under the hood. There's a Django ticket to make prefetch_related take max_query_params into consideration with an unmerged/incomplete patch. If this was to be adjusted and merged and max_query_params was appropriately defined for this backend then prefetch_related would work just fine. Until these two issues are addressed it will keep crashing.

etiennepouliot commented 4 years ago

See my comments for a solution : https://github.com/ESSolutions/django-mssql-backend/issues/9

techknowfile commented 3 years ago

See my comments for a solution : ESSolutions#9

The solution given here still doesn't work for prefetch_related. A complete solution has been implemented on the dev branch of the Microsoft fork. See my issue on that repo for details.

The solution uses the Microsoft recommended solution to large parameter lists by creating a TEMP TABLE and joining over that.

michiya / django-pyodbc-azure

Large IN filters cause trouble #143