groue / GRDB.swift

A toolkit for SQLite databases, with a focus on application development
MIT License
6.83k stars 705 forks source link

Wrong association aggregate count when using HasManyThrough, HasMany and Self Joins #777

Closed simensol closed 4 years ago

simensol commented 4 years ago

I have a Task record which has a HasManyThrough association (many-to-many connection) with a Tag record. The Task record also has a HasMany association (one-to-many connection) to other Task records using Self Joins. Thus, the Task record has two types: project and subtask:

struct Task: Identifiable {
    var id = UUID()
    var projectId: UUID?
    var title: String
    var status: TaskStatus
}

extension Task: TableRecord {
    // Projects and tasks.
    static let subtasks = hasMany(Task.self, key: "subtasks")
    static let project = belongsTo(Task.self, key: "project")
}

I want to select all projects and associated tags and subtasks. The following works as expected:

let request = Task
    .all()
    .including(optional: Task.project)
    .including(all: Task.subtasks)
    .including(all: Task.tags)
    .annotated(with: Task.tags.count)

I also want to include the total count of subtasks. Thus, I have added annotated() below:

let request = Task
    .all()
    .including(optional: Task.project)
    .including(all: Task.subtasks)
    .including(all: Task.tags)
    .annotated(with: Task.subtasks.count.forKey("totalSubtaskCount"))

totalSubtaskCount is correct. However, I also want to include the total count of active subtasks. To achieve this, I have tried to use sum():

let request = Task
    .all()
    .including(optional: Task.project)
    .including(all: Task.subtasks)
    .including(all: Task.tags)
    .annotated(with: Task.subtasks.count.forKey("totalSubtaskCount"))
    .annotated(with: Task.subtasks.sum(Task.Columns.status == TaskStatus.Active).forKey("remainingSubtaskCount"))

If the number of associated tags is equal to 1, totalSubtaskCount is correct. However, if the number of associated tags is > 1, the sum is multiplied by the number of associated tags. For example, if a project has 3 associated subtasks and 2 associated tags, totalSubtaskCount becomes 3*2=6 when the correct number is 3.

Environment

GRDB flavor(s): GRDB + GRDBCombine GRDB version: 5.0.0-beta.2 Installation method: SPM Xcode version: 11.5 beta 2 (11N605f) Swift version: 5.2.2 Platform(s) running GRDB: iOS **macOS version running Xcode: 10.15.4 (19E287)

groue commented 4 years ago

Hello @simensol,

Can you please check the request below?

// Two distinct populations of subtasks,
// identified by their association keys,
// that can be counted independently:
let subtasks = Task.subtasks // key "subtasks"
let remainingSubtasks = Task.subtasks
    .filter(Task.Columns.status == TaskStatus.Active)
    .forKey("remainingSubtasks")

let request = Task
    .all()
    .including(optional: Task.project)
    .including(all: Task.subtasks)
    .including(all: Task.tags)
    .annotated(with: subtasks.count.forKey("totalSubtaskCount"))
    .annotated(with: remainingSubtasks.count.forKey("remainingSubtaskCount"))

It should give the expected results. If it does, then I'll describe why it works when the first request does not.

groue commented 4 years ago

(BTW, thank you for trying the beta!)

groue commented 4 years ago

then I'll describe why it works when the first request does not.

Sorry for this hand waving. The topic is already addressed here: https://github.com/groue/GRDB.swift/blob/master/Documentation/AssociationsBasics.md#isolation-of-multiple-aggregates

I suggest you check this link, but I'm not very confident it is very clear. Here is an analysis focused on your app.

In the original request:

  1. The two aggregates are computed on the same population of subtasks (all subtasks).
  2. The sum of 1/0 booleans did not work well, because SQLite would join subtasks several times (I think this is a side effect of the self join, but we'd need the raw SQL to be sure).

The count aggregate avoids (2) by performing a COUNT(DISTINCT id).

But in order to use two count aggregates that are computed independently, we need two distinct subtasks associations, identified by their association keys ("subtasks" and "remainingSubtasks"):

// Two distinct populations of subtasks,
// identified by their association keys,
// that can be counted independently:
let subtasks = Task.subtasks // key "subtasks"
let remainingSubtasks = Task.subtasks
    .filter(Task.Columns.status == TaskStatus.Active)
    .forKey("remainingSubtasks")
simensol commented 4 years ago

Your suggested request works! The link you provided also makes more sense now. However, I don't think it is all clear (al least to me) how to apply "Isolation of Multiple Aggregates" when using self joins from the documentation. But this is maybe a very special use case. Anyway, thanks for providing a solution and explanation!

(And thanks for continuing the development of this fantastic library! It has made my sqlite life much more fun!)

groue commented 4 years ago

Thanks @simensol ! I hope that a few difficulties like this one are compensated by everything that works well πŸ˜…

The fact that requests can be arbitrarily complex and involve any number of associations and aggregates does not make it easy, on the contrary. It's impossible to write GRDB tests for all combinations. It's not always easy to interpret unexpected results. And the documentation is not able yet to help readers build a consistent mental model.

I'm quite aware of that.

My suggestions:

  1. Write tests for your requests. Feed them with various database contents, and check the output. The goal is not to test GRDB, it is to test that your database access methods behave as expected by the rest of your app.

  2. When you think the output of GRDB is wrong, open an issue (like you did, bravo). I usually can help, but... I'm never happy to answer "do it in another way": do we fix an application bug, or do we workaround a GRDB limitation? Hence the third, more involved bullet:

  3. When you think the output of GRDB is wrong (level 2), examine the generated SQL, and check if it is able to fulfill the intended and documented behavior according to your interpretation. If it does not, open an issue with a reproducible case, and suggest a fixed SQL query. We'll then start a very interesting conversation.

simensol commented 4 years ago

I have experienced much more trouble in my years using vanilla sqlite, so when I say that GRDB makes my sqlite life fun, its no understatement πŸ˜‰ When started with GRDB I often wrote my own queries. However, after learning more, most of these queries are now replaced by GRDB.

As for the mental model, I think the documentation does quite well. At least compared to other libraries I have worked with. Especially the figures in The Types of Associations section have been invaluable to me ❀️

I will keep your 3 suggestions in mind. Thanks, again!

groue commented 4 years ago

@simensol, I reopen this issue. The fact that the count aggregate is able to produce correct results, when sum is not, reveals a problem.

The GRDB 4 engine was not able to deal with subqueries, so aggregate annotations were implemented with joins and grouping. There lies the problem, I think.

Now that the GRDB 5 engine is able to deal with subqueries, we may be able to implement aggregate annotations on top of them.

simensol commented 4 years ago

That seems reasonable, @groue. If you make any changes to the beta, I can test it on my database. Here is a simplified version of the sql query produced by GRDB for my initial solution:

SELECT
   "task1"."id",
   "task1"."projectId",
   "task1"."title",
   "task1"."status",
   COUNT(DISTINCT "task3"."id") AS "totalSubtaskCount",
   SUM("task3"."status" = 'Active') AS "remainingSubtaskCount",
   COUNT(DISTINCT "tag"."id") AS "tagCount",
   "task2".* 
FROM
   "task" "task1" 
   LEFT JOIN
      "task" "task2" 
      ON "task2"."id" = "task1"."projectId" 
   LEFT JOIN
      "task" "task3" 
      ON "task3"."projectId" = "task1"."id" 
   LEFT JOIN
      "taskTag" 
      ON "taskTag"."taskId" = "task1"."id" 
   LEFT JOIN
      "tag" 
      ON "tag"."id" = "taskTag"."tagId" 
GROUP BY
   "task1"."id"

The above query produces remainingSubtaskCount which are multiplied by the number of associated tags (tagCount).

groue commented 4 years ago

Thanks @simensol. I admit that without the schema, it's not quite obvious to interpret the SQL. If I may ask for your help: I also wonder how you would craft the SQL manually. You may know SQL tricks I don't know.

groue commented 4 years ago

If you make any changes to the beta

Yes, I think this is worth solving. It will be more a bugfix than a breaking change. There is no ETA for the release of GRDB 5, so we can take the time we need. I'll have a look also, of course.

simensol commented 4 years ago

I have tried to craft a query which produce the expected result without any success. However, I'm no sqlite expert.

Further, my attempt to use sum to find the number of associated subtasks should be considered a workaround. I think count is more appropriate is this case, as you suggested. Maybe my results are actually the expected behaviour of sum when using it on associated tables?

If you still think this is a bug, you can download my database with synthetic data here to see the schema and test the query. If you run the query above on the database, you will see some remainingSubtaskCount which are >15. However, the maximum number of subtasks are no more than 10.

groue commented 4 years ago

Hello @simensol. Thanks for investigating a little more.

You are right, I think count better expresses the need of your application. It's easier to read than the sum of 0/1.

But this sum of 0/1 is equivalent to a filtered count. There is no good reason for this sum to be wrongly computed.

The way count is implemented (using the DISTINCT sql modifier) is a sign that something is wrong in the way aggregates are computed. I know that some tests will break if I remove that DISTINCT. But there is no such equivalent of DISTINCT for sum and other mathematical aggregates. Conclusion: we have a plain bug that needs to be fixed. #778 is an early stage pull request that fixes it.

simensol commented 4 years ago

Aha, you're right. hasMany associations should not have any effect on the aggregated sum.

groue commented 4 years ago

hasMany associations should not have any effect on the aggregated sum

Indeed, including(all:) are performed in distinct queries, completely isolated.

I've been looking at Django ORM, from which I've "stolen" the idea of the annotate function: https://docs.djangoproject.com/en/3.0/topics/db/aggregation/

I found evidence that they implement association aggregates with JOINs and GROUP BY, as GRDB currently does: https://markusholtermann.eu/2019/03/less-obvious-things-to-do-with-djangos-orm/

I'm happy because I was never quite confident about this implementation πŸ˜…

Now, Django can also count filtered associations: https://docs.djangoproject.com/en/2.2/topics/db/aggregation/#filtering-on-annotations. I don't quite know what SQL they generate in this case. This page may give a hint they use some specific SQL danse, or even database feature (that SQLite lacks). And their documentation only talks about count, not about sum. A few searches on Google and Stack Overflow did not provide any conclusive answer. And I'm too lazy to install virtualenv and pip now so that I can run some actual Django+SQLite tests πŸ˜… (I'm sorry I've lost my ease with Python).

Meanwhile, I found the GRDB test that fails if I replace COUNT(DISTINCT) with COUNT, and thus should reveal a problem with SUM. The same problem, I expect, than this very issue:

// A request that does not provide the correct results if we don't use COUNT(DISTINCT):
//
// SELECT
//     "team".*,
//     COUNT(DISTINCT "player1"."id") AS "lowPlayerCount",
//     COUNT(DISTINCT "player2"."id") AS "highPlayerCount"
// FROM "team"
// LEFT JOIN "player" "player1" ON ("player1"."teamId" = "team"."id") AND ("player1"."score" < 500)
// LEFT JOIN "player" "player2" ON ("player2"."teamId" = "team"."id") AND ("player2"."score" >= 500)
// GROUP BY "team"."id"
let request = Team
    .annotated(with: Team.players.filter(Column("score") < 500).forKey("lowPlayers").count)
    .annotated(with: Team.players.filter(Column("score") >= 500).forKey("highPlayers").count)

The two joins are messing with each other.

In #778, I have tried to implement association aggregates with subqueries. It kind of works, but... I fear that this is suboptimal SQL, and the Django precedent tells me that joins are the way to go.

Maybe GRDB should recognize queries where aggregates will be wrong, and simply fatalError with the "not implemented" message. This would already be an improvement. There are already plenty of such safe guards in GRDB.

And we can deal with count, min and max. Only sum and avg are subject to wrong results.

I'll sleep on this.

groue commented 4 years ago

@simensol, I have difficulties merging all the information you have provided, and build up a reproducible case. For example, https://github.com/groue/GRDB.swift/issues/777#issuecomment-629152830 does not match the initial post (what is this tagCount)?

Could you please give a hand, and provide a minimum reproducible case? I don't care about the data, but only about the schema and the request. Please remove anything that is not strictly needed to get a wrong sum, if you want to be extra helpful :-)

Thank you!

simensol commented 4 years ago

Sorry for the confusion, @groue! Here is the minimum schema:

Task.swift:

struct Task: Identifiable {
    var id = UUID()
    var projectId: UUID?
    var title: String
    var status: TaskStatus
}

extension Task: TableRecord {
    // Projects and tasks.
    static let subtasks = hasMany(Task.self, key: "subtasks")
    static let project = belongsTo(Task.self, key: "project")
}

extension Task: Codable, FetchableRecord, MutablePersistableRecord {
    // Define database columns from CodingKeys.
    enum Columns {
        static let id = Column(CodingKeys.id)
        static let projectId = Column(CodingKeys.projectId)
        static let title = Column(CodingKeys.title)
        static let status = Column(CodingKeys.status)
    }
}

Tag.swift:

struct Tag: Identifiable {
    var id = UUID()
    var name: String
}

extension Tag: TableRecord {
    static let taskTags = hasMany(TaskTag.self)
    static let tasks = hasMany(Task.self, through: taskTags, using: TaskTag.task)
}

extension Tag: Codable, FetchableRecord, MutablePersistableRecord {
    enum Columns {
        static let id = Column(CodingKeys.id)
        static let name = Column(CodingKeys.name)
    }
}

TaskTag.swift:

struct TaskTag: Codable, FetchableRecord, MutablePersistableRecord {
    var taskId: UUID
    var tagId: UUID
}

extension TaskTag: TableRecord {
    static let task = belongsTo(Task.self)
    static let tag = belongsTo(Tag.self)
}

extension TaskTag {
    enum Columns {
        static let taskId = Column(CodingKeys.taskId)
        static let tagId = Column(CodingKeys.tagId)
    }
}

And here is a a minimum request:

let request = Task
    .all()
    .including(optional: Task.project)
    .including(all: Task.subtasks)
    .including(all: Task.tags)
    .annotated(with: Task.subtasks.sum(Task.Columns.status == TaskStatus.Active).forKey("remainingSubtaskCount"))

My original code is very complex, so this is cut-and-paste (that is, not complied). Let me know if you need anything more!

groue commented 4 years ago

Let me know if you need anything more!

Actually, I do. I just can't reproduce wrong sum values. I need a full reproducible case.

I would really appreciate if the case were minimal, which means that all code that is not strictly necessary to exhibit a wrong sum is removed.

groue commented 4 years ago

Sorry if I bother you, @simensol, but I would greatly appreciate if you would spend a few minutes setting up a reproducible case. This issue is quite concerning, and I would not like if other GRDB users would be affected. Of course, there is no obligation. But is it possible that you give a hand?

simensol commented 4 years ago

Sorry for the delayed answer, @groue. I really would provided you with a reproducible case. However, unfortunately I don't have any commits from back then. Hence, I think my previous post is the best I can do. I'm very sorry for this.

groue commented 4 years ago

Thank you @simensol. Unfortunately this post does not contain enough information - for me at least. Maybe you can reproduce the failure setup from this previous post of yours, since you have a better recognition of the state of your app at this stage? As I said, this issue is concerning, but I am no wizard, and your help would be greatly appreciated.

groue commented 4 years ago

Closing due to lack of activity. Let's wait for the next bug report, and I won't forget to ask for a reproducible case ;-)

simensol commented 4 years ago

Sorry for being quiet for so long, @groue. Due to a new permanent job, I havn't had time to do any iOS development since May. Hopefully I will be able to continue on my project (where I use GRDB) later. I'll promise to give it a new try when I get back in business :)