Open beruic opened 6 years ago
After a good night of sleep, I have been giving this some more thought.
I think the right solution to apply to this issue now, is to simply raise a meaningful exception when an in clause has more than 2000 items. This is by far the easiest solution for now, until something further is done to mitigate this, and then it is trivial to split up ones own code.
I still find it feasible to split up the request on the IN
parameters, but this is hard, because other split-ups must be taken into consideration.
Alternatively the parameters can be typed out for each chunk, as I did in my modified code above. this will make it much easier to handle other clauses, but will make it impossible to use executemany
, and we will lose the potential optimizations from this one.
A third solution might be to split up the IN
clause in the same SQL statement, as well as typing it out. I will try this out before I need to move on.
I still find it strange that this is an issue for MSSQL, and not any of the other database backends I am using.
Ok, I have applied my above hack for typing out the parameters, as well as added DatabaseOperations.max_in_list_size
.
class DatabaseOperations(BaseDatabaseOperations):
...
def max_in_list_size(self):
return 2000
....
Defining DatabaseOperations.max_in_list_size
to 2000 does that for every 2000 parameters a new IN
clause will be OR
'ed on the query, like IN (?, ?, ...[1997]..., ? OR IN ?, ?, ...[1997]..., ? OR IN ...)
.
Unfortunately this makes no difference, and the hack for typing out the parameters seems the most efficient. It runs rather fast too, but fails on more than 32314 elements on both my servers (2017 Express and 2012 Standard), so I suspect the limit to be a default setting in SQL Server, since memory, does not spike during the operation. The exception it throws on 32315 or more elements is:
Traceback (most recent call last):
File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
File "/home/jk/.virtualenvs/text_search_service/lib/python3.5/site-packages/sql_server/pyodbc/base.py", line 551, in execute
return self.cursor.execute(sql)
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information. (8623) (SQLExecDirectW)')
About the 2000 limit in my previous comment, I guess that is hacky as well, and cannot be trusted. I suspect however, that the first exception of the original three, that ends off with
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003) (SQLExecDirectW)')
Is the correct error message in any case for this issue, and work needs to be done to figure out why the others arise. For now I am splitting up the update in my own code, until we figure out how to fix this in django-pyodbc-azure
. Personally I think the best approach is to figure out how to split up execution to executemany
can be utilized.
Seems to be the same as #101.
I'm not sure how SQL Server treats query params but .features.max_query_params
probably needs to be set to 2000 as well.
Can you elaborate @charettes, for those of us who are not completely sure what you refer to?
@beruic I'm referring to this flag
I've tried tweaking that flag, both as a hot-patch at the beginning of my code:
from sql_server.pyodbc import features
features.DatabaseFeatures.max_query_params = 2000
And by just appending that property to the end of the file:
# cat /opt/conda/lib/python3.6/site-packages/sql_server/pyodbc/features.py
from django.db.backends.base.features import BaseDatabaseFeatures
class DatabaseFeatures(BaseDatabaseFeatures):
# ....
max_query_params = 2000
And I'm still getting the errors described above:
ProgrammingError: ('The SQL contains 13095 parameter markers, but 144167 parameters were supplied', 'HY000')
Ok so here's the deal. From my understanding there's no way to perform a IN (id0, ..., idN)
in a single query on SQL Server (or any backend with such limitations) where N > 2000. You'll have to either tweak your servers configuration if you want to allow more or live with it; the driver simply doesn't allow you to pass more than N parameters so there's no way idN+1
will make it through.
Now for @scnerd issue regarding the usage of prefetch_related
which relies on IN
queries under the hood. There's a Django ticket to make prefetch_related
take max_query_params
into consideration with an unmerged/incomplete patch. If this was to be adjusted and merged and max_query_params
was appropriately defined for this backend then prefetch_related
would work just fine. Until these two issues are addressed it will keep crashing.
See my comments for a solution : https://github.com/ESSolutions/django-mssql-backend/issues/9
See my comments for a solution : ESSolutions#9
The solution given here still doesn't work for prefetch_related. A complete solution has been implemented on the dev branch of the Microsoft fork. See my issue on that repo for details.
The solution uses the Microsoft recommended solution to large parameter lists by creating a TEMP TABLE and joining over that.
I know that it seems there is probably more than one bug here, but please bear with me. This was all discovered during the same chain of events, ans aside from the number of elements, nothing has changed between requests. I suspect the solution to all issues might be the same, but hope that someone with more insight, might come through.
I am running some updates on a large amount of instances of a very simply model. The model is:
I have a bunch of instances (about 127000) and sometimes I want to update the
verified
field on a bunch of them, so I doTerm.objects.filter(term__in=term_set).update(verified=timezone.now())
whereterm_set
is a set of terms (strings) to update.Any time my
term_set
contains more than 2096 elements, I get an exception. I have verified that it is the number of elements that is the problem, by transposing the selection of the elements for the set, from a complete list of my terms.The exception however varies a bit on the number of elements. At 2097 to 2099 elements I get:
Now this suggests a solution. More on this later. At first I thought this was because I used an SQL Server 2017 Express, but then I tried on a database on our production server, which is 2012 Standard, with same results.
From 2100 to 32766 I get
And from 32767 and onwards I get
This last one varies on the numbers. Here is for 50000 (Relevant line only)
and 100000
The N + 1 parameters fit with that the SQL that gets generated is in the form
SET NOCOUNT OFF; UPDATE [core_givenname] SET [verified] = ? WHERE [core_givenname].[term] IN (?, ?, ..., ?)
.It should be noted that all of these outputs have been shortened to the relevant parts, since all of them has a caused exception below the ones I have included. Here is the full output for 100000 elements:
I use ODBC Driver 17 for SQL Server but I have tried with version 13.1 as well, with the same results. I have also confirmed the behaviour bug is present
django-pyodbc-azure
versions 2.0.1.0 with Django 2.0.2, but most of my testing and fixing effort has been done ondjango-pyodbc-azure
1.11.9.0 and Django 1.11.10, as those are the versions I primarily use. I have briefly touched FreeTDS, which in version 0.9.1 (Ubuntu default) seems to raise the number for how many elements that can successfully go through, but it was bothersome getting to work, and raise other issues that could not easily be fixed, since updating to a newer version of FreeTDS is out of scope for me at the moment.One thing I did try to do to fix this, was to apply the parameters to the SQL directly in python, and only send the SQL, which actually seems to work reasonably, but at some number of elements, your query run out of internal memory. Besides, I'm pretty sure my modifications are bug-prone - at least under Python 2 (Which Django 1.11 still supports, but
django-pyodbc-azure
could of course drop support there anyway). In any case, my modified code forCursorWrapper
inbase.py
is here:Now back to the seeming limit of 2100 parameters. If this is really the culprit (and it is indicated from here, here, and here), I guess we have to break up the updates in multiple executions, by compiling a chunk sized SQL for most of the updates, and send it to
CursorWrapper.executemany
, and the rest toCursorWrapper.execute
, and then sum the counts for updates. I cannot imagine this having super performance, compared to doing it in one go, but it probably beats running out of memory.This solution involves fixing the SQL compiler, and I am not very much into that. I'd wish there was just a place you could tell it what the max parameter count was, to the default compiler implementation have it automatically split it up. I do not think this affect other areas than
IN
clauses, since I have not had any problems updating otherwise, and bulk inserting doesn't use parameters (and I have tried with over 200000 elements, which is why I wondered a lot over this issue until I figured this out).A future optimization of the above, might involve sending the SQL alone, and in the SQL prepare the terms as a temporary in-memory table, but let's make it work first, and then optimize later.
Sorry if I have mistakes in my writing here. It's very late, and I am off to bed now :)