npgsql / efcore.pg

Entity Framework Core provider for PostgreSQL
PostgreSQL License
1.52k stars 223 forks source link

Translating linq select of a column in a foreign key table to ARRAY subselect #712

Closed dpsenner closed 5 years ago

dpsenner commented 5 years ago

Hi,

we attempt to fetch a bunch of rows from one table that references another table via n:m relation table and select one column from the related table too. So far we have found no way to make the query generator generate a select query containing an ARRAY() subselect. Ideally the generated query should look like:

select user.id, user.name,
ARRAY(SELECT role.Name FROM role INNER JOIN user_role ON user_role.role_id = role.id WHERE user_role.user_id = user.id)
FROM user
WHERE ...

when writing the following linq query:

DbContext
  .User
  .Where(...)
  .Select(user => new {
    user.Id,
    user.Name,
    RoleNames = user
      .UserRoles
      .Select(userRole => userRole.Role.Name)
      .ToArray()
  })

The good thing is that currently the above linq query generates two queries which performs rather good. One query selects the columns from the user table, the other selects the role name from the related table by joining the previous user table query.

select user.id, user.name FROM user WHERE ...
SELECT t1.id, t3.name, t2.user_id
FROM user_role AS t2
INNER JOIN role AS t3 ON t2.role_id = t3.id
INNER JOIN (
    SELECT t0.id
    FROM ( SELECT * FROM user WHERE ... ) AS t0
) AS t1 ON t2.user_id = t1.id
ORDER BY t1.id

However this means also that the database has to run through all the returned users twice, once for the users and once for the role names of a user. The first query would fetch the role names with one roundtrip only. Adding additional related tables further adds another roundtrip and increases the latency further while the single column of that related table could be fetched with yet another ARRAY() column.

Another bad thing we noticed is that rewriting the linq query to read as RoleNames = new string[] { ... } for the role names worses the situation because then a subquery is executed for every returned user. To me this was an unexpected side effect that I consider as a bug.

roji commented 5 years ago

@dpsenner, the core parts of EF Core, which generate the general structure of queries, was built for relational databases in general, whereas arrays are pretty much a PostgreSQL-specific thing. Npgsql does support certain array features, but we definitely haven't gone into more advanced things such as what you're proposing.

However, your "ideal" query above actually seems to use a subquery (executed for each row to create the array):

                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Seq Scan on "Users" u  (cost=0.00..17380.30 rows=1270 width=68)
   SubPlan 1
     ->  Bitmap Heap Scan on "Roles" r  (cost=4.20..13.67 rows=6 width=32)
           Recheck Cond: ("UserId" = u."Id")
           ->  Bitmap Index Scan on "IX_Roles_UserId"  (cost=0.00..4.20 rows=6 width=0)
                 Index Cond: ("UserId" = u."Id")

The queries produced by EF Core seem to be much more efficient.

I haven't actually checked anything in depth and I'm not sure exactly what you're doing, but carefully use EXPLAIN on your queries and analyze costs rather than assuming that two queries are necessarily worse than one.

dpsenner commented 5 years ago

For our usecase the query with ARRAY() subselect per-user is more efficient than two selects. Both selects return a large set and we further have to wait for both queries to complete and also for entity framework to have joined the records in memory. This also means that with these two queries it requires us to store all users and their roles in memory while processing the request. This is unfortunate because the queries are run from behind a grpc streaming api, allowing us to stream some users while still fetching the results from the database. Streaming greatly reduces the memory footprint on the server and the timing of the request from the viewpoint of the client.

roji commented 5 years ago

cc @ajcvickers @divega on streaming vs. buffering discussion, and also the other perf aspects.

I do get the streaming argument, although be careful not to make your application lighter by pushing complexity to PostgreSQL (by introducing subqueries) - overall performance could suffer. I'm still a little surprised that a subselect doesn't introduce a serious perf degradation in PostgreSQL, but this is going to depend on your specific usecase.

In any case, implementing something like this would probably be a pretty major change, which we generally don't do in Npgsql (as it touches on the core of EF Core). Let's wait for feedback from the EF Core team on this.

dpsenner commented 5 years ago

The performance penalty we pay with ARRAY(SELECT) is respectably low. Somewhere we have to do the roundtrips to the database, doing the roundtrip in ARRAY(SELECT) is effectively the fastest.

Note that if we had to transfer more than one column from that dependent entity role, we would have to find another solution to this problem. I have seen in another project that EF6 would in this scenario generate one large query containing both the columns of the user, the columns of the role and a case-column allowing the object mapper to properly map the returned dataset to the hierarchical object model. Something along the lines of:

with users as (
    select
     t0.id, t0.code, t0.display_name
    from access_control.user t0
    where t0.display_name like '%bar%'
)

select
 0 AS ttype, t0.id, t0.code, t0.display_name, null, null
from users t0

union all

select
 1 AS ttype, t1.id, null, null, t3.id, t3.name
from
users t1
inner join access_control.user_role t2 on t1.id = t2.user_id
inner join access_control.role t3 on t3.id = t2.role_id

order by 2, 1, 4

This obviously is the best among all the available options what entity framework could do for us. grin

I'm looking forward to read the comments of the EF core team.

roji commented 5 years ago

Somewhere we have to do the roundtrips to the database, doing the roundtrip in ARRAY(SELECT) is effectively the fastest.

Just to be terminologically clear, when we say database roundtrips, we're usually referring to network roundtrips, where Npgsql sends a query and the database responds. Embedding ARRAY(SELECT ...) is definitely not a roundtrip - it's just a subquery embedded in your query. Only one (network) roundtrip occurs. To further make this point clear, it's possible to send two queries in a single roundtrip; this is called batching and can greatly improve performance.

I just want to be sure that you've actually benchmarked the different approaches on your actual databases, and found that the single-query-with-subquery approach was indeed faster than EF Core's two-query approach - it's very important not to assume anything like this without testing it. Of course it's somewhat difficult to compare in this case, since EF Core simply does not support the 1st approach, but I'd definitely benchmark the two queries on PostgreSQL itself (using pgbench) before arriving at any conclusions. Again, something that may seem more efficient on a small dataset could quickly bring your dataset to its knees when executed on a larger dataset.

Note that if we had to transfer more than one column from that dependent entity role, we would have to find another solution to this problem.

Not necessarily - PostgreSQL definitely supports arrays of records (which are roughly weakly-typed tuples) or composites (strongly-typed tuples), so you could use the same technique to construct those. Consuming these on the client side is another matter, which indeed could create problems.

If we're discussing how to transfer join data, yet another option is a classical single query with a join, where each row contains the user details along with details for a single role. The downside here that the user details are duplicated for each role of that user (so data is duplicated on the network), but it's both streamable at the client side and does not contain a subquery.

I think all this shows that there are multiple approaches, each with its advantages and disadvantages. I can tell you with reasonable certainty that your proposal with ARRAY(SELECT) would be a very poor choice as a default joining mechanism in EF, because of the subquery it contains.

dpsenner commented 5 years ago

Just to be terminologically clear, when we say database roundtrips, we're usually referring to network roundtrips, where Npgsql sends a query and the database responds.

Above my intention behind a database roundtrip was the operation to fetch a record from the underlying store (meaning the data as it is stored on disk and ignoring caching techniques etc) and returning it to the invoker. I did not intend this to be related to a network roundtrip. For instance, an application running a SELECT query does a roundtrip to the database store over the network. A sub-SELECT within a SELECT semantically does the same but the roundtrip to the database store but does not involve networking. Obviously the latency of a subselect within the database is by a magnitude faster than a network roundtrip from a remote application, not to mention the numerous ways and details of how the database can further speed up a subselect. This said, having to send multiple SELECT statements from the application is always slower because it involves networking. But it further adds load to the database because the second query statement has to re-evaluate the previous query to fetch the correct subset of related records. This essentially means that the database has to do the same task twice.

PostgreSQL definitely supports arrays of records

Arrays of records is an interesting thing. I did not know about this, thanks! Having those would definitely reduce the complexity of writing fancy queries with unions and whatnot. The following query gets the full role entities of a user in a column:

select
     t0.id, t0.code, t0.display_name, array(select t2 from access_control.user_role t1 inner join access_control.role t2 on t1.role_id = t2.id where t1.user_id = t0.id)
    from access_control.user t0
[
  {
    "Plan": {
      "Node Type": "Seq Scan",
      "Parallel Aware": false,
      "Relation Name": "user",
      "Schema": "access_control",
      "Alias": "t0",
      "Startup Cost": 0,
      "Total Cost": 233100.8,
      "Plan Rows": 11126,
      "Plan Width": 68,
      "Actual Startup Time": 0.059,
      "Actual Total Time": 160.521,
      "Actual Rows": 11126,
      "Actual Loops": 1,
      "Output": [
        "t0.id",
        "t0.code",
        "t0.display_name",
        "(SubPlan 1)"
      ],
      "Shared Hit Blocks": 82352,
      "Shared Read Blocks": 0,
      "Shared Dirtied Blocks": 0,
      "Shared Written Blocks": 0,
      "Local Hit Blocks": 0,
      "Local Read Blocks": 0,
      "Local Dirtied Blocks": 0,
      "Local Written Blocks": 0,
      "Temp Read Blocks": 0,
      "Temp Written Blocks": 0,
      "I/O Read Time": 0,
      "I/O Write Time": 0,
      "Plans": [
        {
          "Node Type": "Nested Loop",
          "Parent Relationship": "SubPlan",
          "Subplan Name": "SubPlan 1",
          "Parallel Aware": false,
          "Join Type": "Inner",
          "Startup Cost": 0.56,
          "Total Cost": 20.92,
          "Plan Rows": 2,
          "Plan Width": 472,
          "Actual Startup Time": 0.007,
          "Actual Total Time": 0.012,
          "Actual Rows": 2,
          "Actual Loops": 11126,
          "Output": [
            "t2.*"
          ],
          "Inner Unique": true,
          "Shared Hit Blocks": 82124,
          "Shared Read Blocks": 0,
          "Shared Dirtied Blocks": 0,
          "Shared Written Blocks": 0,
          "Local Hit Blocks": 0,
          "Local Read Blocks": 0,
          "Local Dirtied Blocks": 0,
          "Local Written Blocks": 0,
          "Temp Read Blocks": 0,
          "Temp Written Blocks": 0,
          "I/O Read Time": 0,
          "I/O Write Time": 0,
          "Plans": [
            {
              "Node Type": "Index Only Scan",
              "Parent Relationship": "Outer",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "u_user_role",
              "Relation Name": "user_role",
              "Schema": "access_control",
              "Alias": "t1",
              "Startup Cost": 0.29,
              "Total Cost": 4.32,
              "Plan Rows": 2,
              "Plan Width": 16,
              "Actual Startup Time": 0.003,
              "Actual Total Time": 0.003,
              "Actual Rows": 2,
              "Actual Loops": 11126,
              "Output": [
                "t1.user_id",
                "t1.role_id"
              ],
              "Index Cond": "(t1.user_id = t0.id)",
              "Rows Removed by Index Recheck": 0,
              "Heap Fetches": 0,
              "Shared Hit Blocks": 22501,
              "Shared Read Blocks": 0,
              "Shared Dirtied Blocks": 0,
              "Shared Written Blocks": 0,
              "Local Hit Blocks": 0,
              "Local Read Blocks": 0,
              "Local Dirtied Blocks": 0,
              "Local Written Blocks": 0,
              "Temp Read Blocks": 0,
              "Temp Written Blocks": 0,
              "I/O Read Time": 0,
              "I/O Write Time": 0
            },
            {
              "Node Type": "Index Scan",
              "Parent Relationship": "Inner",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "role_pkey",
              "Relation Name": "role",
              "Schema": "access_control",
              "Alias": "t2",
              "Startup Cost": 0.28,
              "Total Cost": 8.29,
              "Plan Rows": 1,
              "Plan Width": 488,
              "Actual Startup Time": 0.004,
              "Actual Total Time": 0.004,
              "Actual Rows": 1,
              "Actual Loops": 19844,
              "Output": [
                "t2.*",
                "t2.id"
              ],
              "Index Cond": "(t2.id = t1.role_id)",
              "Rows Removed by Index Recheck": 0,
              "Shared Hit Blocks": 59623,
              "Shared Read Blocks": 0,
              "Shared Dirtied Blocks": 0,
              "Shared Written Blocks": 0,
              "Local Hit Blocks": 0,
              "Local Read Blocks": 0,
              "Local Dirtied Blocks": 0,
              "Local Written Blocks": 0,
              "Temp Read Blocks": 0,
              "Temp Written Blocks": 0,
              "I/O Read Time": 0,
              "I/O Write Time": 0
            }
          ]
        }
      ]
    },
    "Planning Time": 0.767,
    "Triggers": [],
    "Execution Time": 161.336
  }
]

The following fetches a subset of the role columns:

select
     t0.id, t0.code, t0.display_name, array(select row(t2.id, t2.name) from access_control.user_role t1 inner join access_control.role t2 on t1.role_id = t2.id where t1.user_id = t0.id)
    from access_control.user t0
[
  {
    "Plan": {
      "Node Type": "Seq Scan",
      "Parallel Aware": false,
      "Relation Name": "user",
      "Schema": "access_control",
      "Alias": "t0",
      "Startup Cost": 0,
      "Total Cost": 233100.8,
      "Plan Rows": 11126,
      "Plan Width": 68,
      "Actual Startup Time": 0.056,
      "Actual Total Time": 131.884,
      "Actual Rows": 11126,
      "Actual Loops": 1,
      "Output": [
        "t0.id",
        "t0.code",
        "t0.display_name",
        "(SubPlan 1)"
      ],
      "Shared Hit Blocks": 82352,
      "Shared Read Blocks": 0,
      "Shared Dirtied Blocks": 0,
      "Shared Written Blocks": 0,
      "Local Hit Blocks": 0,
      "Local Read Blocks": 0,
      "Local Dirtied Blocks": 0,
      "Local Written Blocks": 0,
      "Temp Read Blocks": 0,
      "Temp Written Blocks": 0,
      "I/O Read Time": 0,
      "I/O Write Time": 0,
      "Plans": [
        {
          "Node Type": "Nested Loop",
          "Parent Relationship": "SubPlan",
          "Subplan Name": "SubPlan 1",
          "Parallel Aware": false,
          "Join Type": "Inner",
          "Startup Cost": 0.56,
          "Total Cost": 20.92,
          "Plan Rows": 2,
          "Plan Width": 32,
          "Actual Startup Time": 0.006,
          "Actual Total Time": 0.009,
          "Actual Rows": 2,
          "Actual Loops": 11126,
          "Output": [
            "ROW(t2.id, t2.name)"
          ],
          "Inner Unique": true,
          "Shared Hit Blocks": 82124,
          "Shared Read Blocks": 0,
          "Shared Dirtied Blocks": 0,
          "Shared Written Blocks": 0,
          "Local Hit Blocks": 0,
          "Local Read Blocks": 0,
          "Local Dirtied Blocks": 0,
          "Local Written Blocks": 0,
          "Temp Read Blocks": 0,
          "Temp Written Blocks": 0,
          "I/O Read Time": 0,
          "I/O Write Time": 0,
          "Plans": [
            {
              "Node Type": "Index Only Scan",
              "Parent Relationship": "Outer",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "u_user_role",
              "Relation Name": "user_role",
              "Schema": "access_control",
              "Alias": "t1",
              "Startup Cost": 0.29,
              "Total Cost": 4.32,
              "Plan Rows": 2,
              "Plan Width": 16,
              "Actual Startup Time": 0.003,
              "Actual Total Time": 0.003,
              "Actual Rows": 2,
              "Actual Loops": 11126,
              "Output": [
                "t1.user_id",
                "t1.role_id"
              ],
              "Index Cond": "(t1.user_id = t0.id)",
              "Rows Removed by Index Recheck": 0,
              "Heap Fetches": 0,
              "Shared Hit Blocks": 22501,
              "Shared Read Blocks": 0,
              "Shared Dirtied Blocks": 0,
              "Shared Written Blocks": 0,
              "Local Hit Blocks": 0,
              "Local Read Blocks": 0,
              "Local Dirtied Blocks": 0,
              "Local Written Blocks": 0,
              "Temp Read Blocks": 0,
              "Temp Written Blocks": 0,
              "I/O Read Time": 0,
              "I/O Write Time": 0
            },
            {
              "Node Type": "Index Scan",
              "Parent Relationship": "Inner",
              "Parallel Aware": false,
              "Scan Direction": "Forward",
              "Index Name": "role_pkey",
              "Relation Name": "role",
              "Schema": "access_control",
              "Alias": "t2",
              "Startup Cost": 0.28,
              "Total Cost": 8.29,
              "Plan Rows": 1,
              "Plan Width": 143,
              "Actual Startup Time": 0.003,
              "Actual Total Time": 0.003,
              "Actual Rows": 1,
              "Actual Loops": 19844,
              "Output": [
                "t2.id",
                "t2.name",
                "t2.note",
                "t2.created_on",
                "t2.created_by",
                "t2.last_changed_on",
                "t2.last_changed_by",
                "t2.enum_value"
              ],
              "Index Cond": "(t2.id = t1.role_id)",
              "Rows Removed by Index Recheck": 0,
              "Shared Hit Blocks": 59623,
              "Shared Read Blocks": 0,
              "Shared Dirtied Blocks": 0,
              "Shared Written Blocks": 0,
              "Local Hit Blocks": 0,
              "Local Read Blocks": 0,
              "Local Dirtied Blocks": 0,
              "Local Written Blocks": 0,
              "Temp Read Blocks": 0,
              "Temp Written Blocks": 0,
              "I/O Read Time": 0,
              "I/O Write Time": 0
            }
          ]
        }
      ]
    },
    "Planning Time": 0.691,
    "Triggers": [],
    "Execution Time": 132.606
  }
]

Both show fantastic numbers on explain and therefore I'd love to see support for fetching rows in this way instead of sending multiple select statements to the database. I'm looking forward to know your opinion.

composites (strongly-typed tuples)

We have evaluated those, but these would be an overkill to implement with entity framework just to fetch entities with denormalized data from related entities.

roji commented 5 years ago

Just to be terminologically clear, when we say database roundtrips, we're usually referring to network roundtrips, where Npgsql sends a query and the database responds.

Above my intention behind a database roundtrip was the operation to fetch a record from the underlying store (meaning the data as it is stored on disk and ignoring caching techniques etc) and returning it to the invoker. I did not intend this to be related to a network roundtrip. [...]

Agreed, was just trying to make sure we use the same terminology. I would be careful to assume that a subquery (or "roundtrip" in your words) has some sort of fixed costs.

In general, aside from the advantage of streaming, I still haven't seen any evidence that your approach really is better in terms of PostgreSQL performance. So I coded up a quick many-to-many model (code at the bottom), and compared both EXPLAIN output for the EF Core queries with against your approach. I preloaded 10000 user objects, each associated to the same 3 roles. Here's what I got:

EF Core generates two queries:

Query 1

test=# EXPLAIN SELECT u."Id", u."Name"
        FROM "Users" AS u
        ORDER BY u."Id";
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using "PK_Users" on "Users" u  (cost=0.28..445.06 rows=6985 width=36)
(1 row)

Query 2

test=# EXPLAIN SELECT t."Id", "r.Role"."Name", "u.UserRoles"."UserId"                                                                             
test-#         FROM "UserRoles" AS "u.UserRoles"
test-#         INNER JOIN "Roles" AS "r.Role" ON "u.UserRoles"."RoleId" = "r.Role"."Id"
test-#         INNER JOIN (
test(#             SELECT u0."Id"
test(#             FROM "Users" AS u0
test(#             WHERE (u0."Name" <> '') OR u0."Name" IS NULL
test(#         ) AS t ON "u.UserRoles"."UserId" = t."Id"
test-#         ORDER BY t."Id";
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Sort  (cost=3165.01..3240.00 rows=29997 width=40)
   Sort Key: u0."Id"
   ->  Hash Join  (cost=343.56..934.35 rows=29997 width=40)
         Hash Cond: ("u.UserRoles"."RoleId" = "r.Role"."Id")
         ->  Hash Join  (cost=304.99..816.77 rows=29997 width=12)
               Hash Cond: ("u.UserRoles"."UserId" = u0."Id")
               ->  Seq Scan on "UserRoles" "u.UserRoles"  (cost=0.00..433.00 rows=30000 width=8)
               ->  Hash  (cost=180.00..180.00 rows=9999 width=4)
                     ->  Seq Scan on "Users" u0  (cost=0.00..180.00 rows=9999 width=4)
                           Filter: (("Name" <> ''::text) OR ("Name" IS NULL))
         ->  Hash  (cost=22.70..22.70 rows=1270 width=36)
               ->  Seq Scan on "Roles" "r.Role"  (cost=0.00..22.70 rows=1270 width=36)
(12 rows)

Here's an EXPLAIN for what I think is your approach, with a subquery:

test=# EXPLAIN SELECT "Users"."Id", "Users"."Name", ARRAY(SELECT "Roles"."Name" FROM "Roles" INNER JOIN "UserRoles" ON "UserRoles"."RoleId" = "Roles"."Id" WHERE "UserRoles"."UserId" = "Users"."Id") FROM "Users" WHERE ("Users"."Name" <> '') OR "Users"."Name" IS NULL; 
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on "Users"  (cost=0.00..323828.18 rows=9999 width=44)
   Filter: (("Name" <> ''::text) OR ("Name" IS NULL))
   SubPlan 1
     ->  Nested Loop  (cost=0.44..32.37 rows=3 width=32)
           ->  Index Only Scan using "PK_UserRoles" on "UserRoles"  (cost=0.29..11.85 rows=3 width=4)
                 Index Cond: ("UserId" = "Users"."Id")
           ->  Index Scan using "PK_Roles" on "Roles"  (cost=0.15..6.84 rows=1 width=36)
                 Index Cond: ("Id" = "UserRoles"."RoleId")
(8 rows)

Now, I'm really not a PostgreSQL performance expert, but the max cost of the 1st EF Core query is 445, and the max cost of the 2nd is 3240. The max cost for your query is reported to be 323828. In other words, your query seems to be a bit less than two orders of magnitude slower in PostgreSQL.

I'll be glad to learn that I'm missing something or have somehow botched my test. Also, this is not an actual benchmark of your application - which is what you really want to do. That is, in certain cases it may be acceptable to put more load on the database server in order to, say, have a streaming query or to reduce CPU usage. But as a general rule, you usually want to remove load from your PostgreSQL, shifting it to your application server(s), which can scale much more easily.

Let me know how this squares with what you're seeing.

dpsenner commented 5 years ago

Quickly looking at your sample model I notice a difference. In my queries I did not specify a where operator and luckily the query planner recognized the cost of these two queries to be different. :-) At this point I finally grasped my not-so-conscious reasoning of why I opened this issue. laughing Sorry for stealing your valuable time by pointing to performance issues while the actual issue is something totally different.

The actual issue here is rather that entity framework makes an educated guess about what it should do with the one linq query it got. The developer implemented one linq query and the expectations are that the linq query is run as one sql query against the database. Entity framework on the other hand does something unexpected. It runs the one linq query as two sql queries against the database and joins the result sets in memory. To me this violates the principle of least astonishment. If the developers intention was to join the records in the application server, he would implement two linq queries.

roji commented 5 years ago

Quickly looking at your sample model I notice a difference. In my queries I did not specify a where operator and luckily the query planner recognized the cost of these two queries to be different. :-) At this point I finally grasped my not-so-conscious reasoning of why I opened this issue. laughing Sorry for stealing your valuable time by pointing to performance issues while the actual issue is something totally different.

OK, so if I understand correctly, you agree that the version with the subquery indeed performs badly (compared to the current EF Core 2-query behavior) and you wouldn't want EF Core to generate it, correct?

FYI I still don't think that a WHERE operator should have any significant impact here - the subquery-based query should perform badly as well.

The actual issue here is rather that entity framework makes an educated guess about what it should do with the one linq query it got. The developer implemented one linq query and the expectations are that the linq query is run as one sql query against the database. Entity framework on the other hand does something unexpected. It runs the one linq query as two sql queries against the database and joins the result sets in memory. To me this violates the principle of least astonishment. If the developers intention was to join the records in the application server, he would implement two linq queries.

First, that's a question for the EF Core team rather than Npgsql - I just do the PostgreSQL provider.

Second, I don't think there's any pretense anywhere of translating a single LINQ query into a single SQL query, and therefore I don't think this violates anything. Your LINQ expression expresses something, which EF Core will later execute in whatever way it deems appropriate or efficient, whether it be with one SQL query or two.

Finally, IIRC EF6 actually generated a single INNER JOIN query for this kind of query. The team made a conscious choice in EF Core to not do that, but instead to split it into two. I think this is related to the duplication that a classical join entails - principal columns are duplicated for every dependent row, significantly increasing the data that gets sent over the wire. There are advantages and disadvantages to both methods, and the team made its decision...

I'm going to close this issue now as it doesn't make sense to implement the subquery-based query, but we can of course continue this conversation.

dpsenner commented 5 years ago

Thanks. For future reference, the issue in EF core for this same topic is #12098.

roji commented 5 years ago

@dpsenner thanks for linking to that... Yeah, https://github.com/aspnet/EntityFrameworkCore/issues/12098#issuecomment-391095895 explains the same thing.

dpsenner commented 5 years ago

I would like to talk a little further about returning related records/columns of related records as ARRAY columns if you don't mind. You just mentioned that ef core would have to generate proper queries. But since ARRAY(select entity from table entity) is something npgsql specific, shouldn't that belong to the entity framework core integration of npgsql?

roji commented 5 years ago

@dpsenner of course - we can definitely continue discussing.

Yeah, anything containing arrays is going to be specific to the Npgsql provider - although since arrays are quite special/advanced we're occasionally blocked by EF Core internals when adding support. However, as we've seen this isn't an efficient way of executing a join, so I wouldn't expect EF Core to actually generate this kind of query?

Note that you always have the option of using raw SQL to write whatever query you want (including ARRAY), and still have EF Core materialize the entities for you. Of course this isn't ideal, but in some cases it's necessary.

dpsenner commented 5 years ago

I'm giving my unconsciousness a chance to work on this problem. Read you soon.

roji commented 5 years ago

Sure thing.

Emill commented 3 years ago

Ok so I've done some initial performance testing regarding PostgreSQL's query plans.

There are basically two strategies, executing a subquery and join + group by + aggregate.

Subquery:

select *, array(select row(tbl2.*) from tbl2 where tbl1.id = tbl2.tbl1id) from tbl1

Join + group by + aggregate:

select tbl1.*, array_agg(tbl2) from tbl1 left join tbl2 on tbl1.id = tbl2.tbl1id group by tbl1.id

The first strategy includes a subquery which will be executed once per row. This means O(n*m) if there is no foreign key index, due to it will perform a full sequential scan on tbl2 for each row in tbl1, which is pretty bad. However, when there is an index for the foreign key, it can be used and hence result in optimal time complexity. An extra inner join inside the subquery in the case of a many-to-many relationship also just uses the primary key index on the third table to follow the extra redirection, which is good.

The second strategy results in a similar query plan compared to using just a traditional left join, plus a sort (if indexes are not available for the foreign key) plus an aggregation. So this basically changes overhead in network traffic (or cartesian explosion in case of 2 or more joins) to some simple aggregation work on the server side.

If there are nested includes (i.e. tbl1 [has-many] tbl2 [has-many] tbl3), then the first strategy will result in a simple and efficient query plan (assuming indexes exist!), while the second one becomes more complex assuming aggregation is performed after every join.

Depending on the specific data and query, a normal join could of course sometimes be more performant, especially if the data that is being duplicated is small.

To me this still looks very promising and I guess it would improve performance over split-query, to avoid the backend to perform duplicate work. It could really be a killer-feature in my opinion when the user wants to perform a complex query including many navigations, or when the returned data is more like a complex json structure than a 2D table. Not sure if this issue should be re-opened, or if it's more of an efcore issue, even though it's quite postgresql-specific.

roji commented 3 years ago

@Emill thanks for putting the time into this... I agree this looks very promising and have had thoughts about it in the past as well.

Can you please open a new issue with the above, and we can continue the conversation there? This issue already contains quite a bit of not quite relevant discussion, would be good to start afresh.