googleapis / python-spanner-django

Cloud Spanner database backend for Django
BSD 3-Clause "New" or "Revised" License
90 stars 28 forks source link

New connection created on each request #739

Open antonstenaxel opened 2 years ago

antonstenaxel commented 2 years ago

Description

I am seeing very long response times running my Django app using a spanner backend. For a simple indexed query which takes around 10 ms in the spanner query editor I am seeing around 2.5 -3 seconds running the app locally. When running in production or on a compute engine in the same region as the spanner intance, these times go down a bit but still stays around 800ms. For reference, the equivalent application using a mysql instance running in google cloud sql in the same region takes around 100ms for the same requests.

As far as I can tell it seems to be due to Django creating a new connection to spanner for every request, which seems to be an expensive operation, introducing significant overhead. In python scripts where I am able to keep the connection alive, only the first connection takes this long, and subsequent queries are much faster.

Environment details

Steps to reproduce

  1. Set up a minimal Django app with the django rest framework as well as creating a spanner instance as outlined in the readme
  2. Monkey patch some functions to track new database connections:
from google.cloud.spanner_v1.database import Database
from django_spanner.base import DatabaseWrapper
import time

old_init = Database.__init__
def new_init(self, *args, **kwargs):
    print("CREATING NEW DATABASE")
    return old_init(self, *args, **kwargs)

Database.__init__ = new_init

old_get_new_connection = DatabaseWrapper.get_new_connection
def new_get_new_connection(*args, **kwargs):
    start = time.time()
    result = old_get_new_connection(*args, **kwargs)
    elapsed = time.time()-start
    print("CALLING DatabaseWrapper.get_new_connection, took", elapsed*1000)
    return result

DatabaseWrapper.get_new_connection = new_get_new_connection
  1. Run server locally using python manage.py runserver
  2. Calling an endpoint to see the outputs as follows:
    December 03, 2021 - 13:40:32
    Django version 3.2
    Starting development server at http://127.0.0.1:8000/
    Quit the server with CONTROL-C.
    [03/Dec/2021 13:40:34] "GET /articles?store_id=1000&uid=2317237300000 HTTP/1.1" 301 0
    CREATING NEW DATABASE
    CALLING DatabaseWrapper.get_new_connection, took 717.8 ms
    [03/Dec/2021 13:40:36] "GET /articles/?store_id=1000&uid=2317237300000 HTTP/1.1" 200 318
    [03/Dec/2021 13:40:59] "GET /articles?store_id=1000&uid=2317237300000 HTTP/1.1" 301 0
    CREATING NEW DATABASE
    CALLING DatabaseWrapper.get_new_connection, took 670.035 ms
    [03/Dec/2021 13:41:02] "GET /articles/?store_id=1000&uid=2317237300000 HTTP/1.1" 200 318

Additional info

For endpoints which hits the database multiple time (for instance first for authentication, then later for the data retrieval) it is only the first query which seems to suffer the overhead of creating the new sessions/connections.

I've tried reproducing this in a python script using the following

from apps.articles.views import ArticlesView
from django.test import RequestFactory
factory = RequestFactory()

def time_view(view, request):
    start = time.time()
    response = view(request)
    elapsed = time.time()-start
    print("Took ", elapsed*1000, "ms")    

request = factory.get('/articles?store_id=1000&uid=2317237300000')
view = ArticlesView().as_view()

time_view(view, request)
time_view(view, request)

Which outputs

=========
CREATING NEW DATABASE
CALLING DatabaseWrapper.get_new_connection, took 1009.135 ms
Took  2729.355 ms
=========
Took  129.973 ms

So it seems to be reasonable response time as long as it is able to reuse a connection, but creating a new one seems very expensive causing the response times to reach around 2-3 seconds.

rajjagesar commented 2 years ago

Hey Antonstenaxel, I am curious to know whether the "CONN_MAX_AGE" setting has any influence on this behavior? (https://docs.djangoproject.com/en/3.2/ref/settings/#std:setting-CONN_MAX_AGE). Could the problem perhaps be remedied with this setting?

Note the following caveat from the documentation (https://docs.djangoproject.com/en/3.2/ref/databases/): The development server creates a new thread for each request it handles, negating the effect of persistent connections. Don’t enable them during development. Not sure if it applies here.

antonstenaxel commented 2 years ago

Thanks for the reply!

I had tried the CONN_MAX_AGE without success, but i appears I had missed the caveat you mentioned, so I tried deploying the app to google app engine and it reduced the response time from the 3 seconds on my local machine to about 250 ms, which obviously is a lot better.

However when running the same query through the standalone django script shown in the original post I get the 100-130 ms response times, which is also what I get when running with the cloud mysql backend, so there seems to be an overhead of 100-150 ms for some reason. When running the query with a line profiler it appears that these functions take up much of that time:

         501459 function calls (501184 primitive calls) in 2.816 seconds

   Ordered by: internal time
   List reduced from 1012 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    1.132    0.566    1.132    0.566 {method 'read' of '_io.BufferedReader' objects}
        1    1.114    1.114    1.115    1.115 {method 'next_event' of 'grpc._cython.cygrpc.SegregatedCall' objects}
    27937    0.190    0.000    0.190    0.000 {built-in method posix.lstat}
       17    0.100    0.006    0.100    0.006 {method 'acquire' of '_thread.lock' objects}
    27977    0.039    0.000    0.063    0.000 posixpath.py:71(join)

Timing the actual query with django_debug_toolbar shows around 100-150 ms here

vi3k6i5 commented 2 years ago

Thanks for brining this to our notice, I was also looking to investigate CONN_MAX_AGE for debugging this. Now I know to start there and investigate more, will update here once I make some progress on this.

IlyaFaer commented 2 years ago

This is related to Spanner sessions creation (see https://cloud.google.com/spanner/docs/sessions#performance_benefits_of_a_session_cache). To deal with it, we use session pools - to re-use heavy objects, instead of utilizing and recreating.

@vi3k6i5, did you have a chance to look at CONN_MAX_AGE? I'm pretty sure there is one more variant to fix it - use a global pool object on new connections creation, see this arg of the connect() function: https://github.com/googleapis/python-spanner/blob/41015715c9465b8b7753b71350a4be3aaeb26704/google/cloud/spanner_dbapi/connection.py#L525-L527

That's how we fixed a couple of problems in SQLAlchemy dialect.