hugobessa / django-shared-schema-tenants

A lib to help in the creation of shared schema multi tenants applications without suffering
MIT License
20 stars 9 forks source link

Improve queries with related attributes to use tenant index #39

Open hugobessa opened 6 years ago

hugobessa commented 6 years ago

When we do filters with related fields, the database is making queries considering the whole related table, instead of only the portion of data from the current tenant. This makes queries very slow when there're many tenants.

One solution is implicitly changing the queryset to filter by the related table tenant too.

Eg.:

MyModel.objects.filter(my_related_field__field1=value1, my_related_field__field2=value2)

sql is something like:

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
    INNER JOIN "my_app_myrelatedmodel"
    ON "my_app_myrelatedmodel"."mymodel_id" = "my_app_mymodel"."id"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id AND
    "my_app_myrelatedmodel"."field1" = value1 AND
    "my_app_myrelatedmodel"."field2" = value2

If we add the related field tenant's filter like this we solve the problem

MyModel.objects.filter(
    my_related_field__tenant_id=tenant_id, my_related_field__field1=value1, 
    my_related_field__field2=value2)

the sql with the fix should something like:

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
    INNER JOIN "my_app_myrelatedmodel"
    ON "my_app_myrelatedmodel"."mymodel_id" = "my_app_mymodel"."id"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id AND
    "my_app_myrelatedmodel"."tenant_id" = tenant_id AND
    "my_app_myrelatedmodel"."field1" = value1 AND
    "my_app_myrelatedmodel"."field2" = value2

For excludes the situation is even worse because it always uses NOT IN operator and subqueries.

Eg.:

MyModel.objects.exclude(my_related_field__field1=value1, my_related_field__field2=value2)

sql is something like:

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE "my_app_myrelatedmodel"."field1" = value1
    ) AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE "my_app_myrelatedmodel"."field2" = value2
    );

If we use the same strategy as in filter the result would be something like

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE  
            "my_app_myrelatedmodel"."tenant_id" = tenant_id
    ) AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE 
            "my_app_myrelatedmodel"."field1" = value1
    ) AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE  
            "my_app_myrelatedmodel"."field2" = value2
    );

For it to be optimized, the generated code should be something like this:

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE 
            "my_app_myrelatedmodel"."tenant_id" = tenant_id AND 
            "my_app_myrelatedmodel"."field1" = value1
    ) AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE  
            "my_app_myrelatedmodel"."tenant_id" = tenant_id AND 
            "my_app_myrelatedmodel"."field2" = value2
    );

The only way I could generate something that looks like the expected result was:

MyModel.objects.exclude(
    my_related__id__in=MyRelatedModel.objects.filter(field__field1=value1, field2=value2))

The resulting sql looks like this:

SELECT "my_app_mymodel"."field1", "my_app_mymodel"."field2"
FROM "my_app_mymodel"
WHERE
    "my_app_mymodel"."tenant_id" = tenant_id AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE 
            "my_app_myrelatedmodel"."tenant_id" = tenant_id AND 
            "my_app_myrelatedmodel"."field1" = value1
    ) AND
    "my_app_mymodel"."my_related_field_id" NOT IN (
        SELECT "my_app_myrelatedmodel"."field1", "my_app_myrelatedmodel"."field2" 
        FROM "my_app_myrelatedmodel"
        WHERE "my_app_myrelatedmodel"."id" IN (
            SELECT "my_app_myrelatedmodel"."id" 
            FROM "my_app_myrelatedmodel"
            WHERE  
                "my_app_myrelatedmodel"."tenant_id" = tenant_id AND 
                "my_app_myrelatedmodel"."field2" = value2
        )
    );
techdragon commented 6 years ago

I might have some ideas how to get these queries to come out more optimised, but I'm not 100% sure if I've followed your reasoning correctly.

This definitely seems like something where a concrete example case would be very helpful. If I had a concrete example I could import in a python REPL and work with I could spend some time in the REPL trying the various Query tools in Django (Aggregation functions, Q objects, Prefetch objects, FilteredRelation objects, etc...) to try and find a way to optimise the query. However if I just made up some model classes to try things I won't be sure if I've created an appropriate example and really found a way to fix these query optimisation issues, or if my example is too simplistic and my possible improvements don't fully solve the problems you're talking about here.

@hugobessa Could you add one or two examples that produce this kind of non-optimal queries from the ORM into the included exampleproject Django project, and/or into the test cases for future use in regression testing the optimisations?

techdragon commented 6 years ago

@hugobessa Its just occurred to me that any work to improve this may have conflicts with the work to make the Tennant model swappable and solve #37