Closed simensol closed 4 years ago
Hello @simensol,
Can you please check the request below?
// Two distinct populations of subtasks,
// identified by their association keys,
// that can be counted independently:
let subtasks = Task.subtasks // key "subtasks"
let remainingSubtasks = Task.subtasks
.filter(Task.Columns.status == TaskStatus.Active)
.forKey("remainingSubtasks")
let request = Task
.all()
.including(optional: Task.project)
.including(all: Task.subtasks)
.including(all: Task.tags)
.annotated(with: subtasks.count.forKey("totalSubtaskCount"))
.annotated(with: remainingSubtasks.count.forKey("remainingSubtaskCount"))
It should give the expected results. If it does, then I'll describe why it works when the first request does not.
(BTW, thank you for trying the beta!)
then I'll describe why it works when the first request does not.
Sorry for this hand waving. The topic is already addressed here: https://github.com/groue/GRDB.swift/blob/master/Documentation/AssociationsBasics.md#isolation-of-multiple-aggregates
I suggest you check this link, but I'm not very confident it is very clear. Here is an analysis focused on your app.
In the original request:
The count
aggregate avoids (2) by performing a COUNT(DISTINCT id)
.
But in order to use two count
aggregates that are computed independently, we need two distinct subtasks associations, identified by their association keys ("subtasks" and "remainingSubtasks"):
// Two distinct populations of subtasks,
// identified by their association keys,
// that can be counted independently:
let subtasks = Task.subtasks // key "subtasks"
let remainingSubtasks = Task.subtasks
.filter(Task.Columns.status == TaskStatus.Active)
.forKey("remainingSubtasks")
Your suggested request works! The link you provided also makes more sense now. However, I don't think it is all clear (al least to me) how to apply "Isolation of Multiple Aggregates" when using self joins from the documentation. But this is maybe a very special use case. Anyway, thanks for providing a solution and explanation!
(And thanks for continuing the development of this fantastic library! It has made my sqlite life much more fun!)
Thanks @simensol ! I hope that a few difficulties like this one are compensated by everything that works well π
The fact that requests can be arbitrarily complex and involve any number of associations and aggregates does not make it easy, on the contrary. It's impossible to write GRDB tests for all combinations. It's not always easy to interpret unexpected results. And the documentation is not able yet to help readers build a consistent mental model.
I'm quite aware of that.
My suggestions:
Write tests for your requests. Feed them with various database contents, and check the output. The goal is not to test GRDB, it is to test that your database access methods behave as expected by the rest of your app.
When you think the output of GRDB is wrong, open an issue (like you did, bravo). I usually can help, but... I'm never happy to answer "do it in another way": do we fix an application bug, or do we workaround a GRDB limitation? Hence the third, more involved bullet:
When you think the output of GRDB is wrong (level 2), examine the generated SQL, and check if it is able to fulfill the intended and documented behavior according to your interpretation. If it does not, open an issue with a reproducible case, and suggest a fixed SQL query. We'll then start a very interesting conversation.
I have experienced much more trouble in my years using vanilla sqlite, so when I say that GRDB makes my sqlite life fun, its no understatement π When started with GRDB I often wrote my own queries. However, after learning more, most of these queries are now replaced by GRDB.
As for the mental model, I think the documentation does quite well. At least compared to other libraries I have worked with. Especially the figures in The Types of Associations section have been invaluable to me β€οΈ
I will keep your 3 suggestions in mind. Thanks, again!
@simensol, I reopen this issue. The fact that the count
aggregate is able to produce correct results, when sum
is not, reveals a problem.
The GRDB 4 engine was not able to deal with subqueries, so aggregate annotations were implemented with joins and grouping. There lies the problem, I think.
Now that the GRDB 5 engine is able to deal with subqueries, we may be able to implement aggregate annotations on top of them.
That seems reasonable, @groue. If you make any changes to the beta, I can test it on my database. Here is a simplified version of the sql query produced by GRDB for my initial solution:
SELECT
"task1"."id",
"task1"."projectId",
"task1"."title",
"task1"."status",
COUNT(DISTINCT "task3"."id") AS "totalSubtaskCount",
SUM("task3"."status" = 'Active') AS "remainingSubtaskCount",
COUNT(DISTINCT "tag"."id") AS "tagCount",
"task2".*
FROM
"task" "task1"
LEFT JOIN
"task" "task2"
ON "task2"."id" = "task1"."projectId"
LEFT JOIN
"task" "task3"
ON "task3"."projectId" = "task1"."id"
LEFT JOIN
"taskTag"
ON "taskTag"."taskId" = "task1"."id"
LEFT JOIN
"tag"
ON "tag"."id" = "taskTag"."tagId"
GROUP BY
"task1"."id"
The above query produces remainingSubtaskCount
which are multiplied by the number of associated tags (tagCount
).
Thanks @simensol. I admit that without the schema, it's not quite obvious to interpret the SQL. If I may ask for your help: I also wonder how you would craft the SQL manually. You may know SQL tricks I don't know.
If you make any changes to the beta
Yes, I think this is worth solving. It will be more a bugfix than a breaking change. There is no ETA for the release of GRDB 5, so we can take the time we need. I'll have a look also, of course.
I have tried to craft a query which produce the expected result without any success. However, I'm no sqlite expert.
Further, my attempt to use sum
to find the number of associated subtasks should be considered a workaround. I think count
is more appropriate is this case, as you suggested. Maybe my results are actually the expected behaviour of sum
when using it on associated tables?
If you still think this is a bug, you can download my database with synthetic data here to see the schema and test the query. If you run the query above on the database, you will see some remainingSubtaskCount
which are >15
. However, the maximum number of subtasks are no more than 10
.
Hello @simensol. Thanks for investigating a little more.
You are right, I think count
better expresses the need of your application. It's easier to read than the sum of 0/1.
But this sum of 0/1 is equivalent to a filtered count. There is no good reason for this sum to be wrongly computed.
The way count
is implemented (using the DISTINCT sql modifier) is a sign that something is wrong in the way aggregates are computed. I know that some tests will break if I remove that DISTINCT
. But there is no such equivalent of DISTINCT
for sum
and other mathematical aggregates. Conclusion: we have a plain bug that needs to be fixed. #778 is an early stage pull request that fixes it.
Aha, you're right. hasMany
associations should not have any effect on the aggregated sum.
hasMany
associations should not have any effect on the aggregated sum
Indeed, including(all:)
are performed in distinct queries, completely isolated.
I've been looking at Django ORM, from which I've "stolen" the idea of the annotate
function: https://docs.djangoproject.com/en/3.0/topics/db/aggregation/
I found evidence that they implement association aggregates with JOINs and GROUP BY, as GRDB currently does: https://markusholtermann.eu/2019/03/less-obvious-things-to-do-with-djangos-orm/
I'm happy because I was never quite confident about this implementation π
Now, Django can also count filtered associations: https://docs.djangoproject.com/en/2.2/topics/db/aggregation/#filtering-on-annotations. I don't quite know what SQL they generate in this case. This page may give a hint they use some specific SQL danse, or even database feature (that SQLite lacks). And their documentation only talks about count
, not about sum
. A few searches on Google and Stack Overflow did not provide any conclusive answer. And I'm too lazy to install virtualenv
and pip
now so that I can run some actual Django+SQLite tests π
(I'm sorry I've lost my ease with Python).
Meanwhile, I found the GRDB test that fails if I replace COUNT(DISTINCT)
with COUNT
, and thus should reveal a problem with SUM
. The same problem, I expect, than this very issue:
// A request that does not provide the correct results if we don't use COUNT(DISTINCT):
//
// SELECT
// "team".*,
// COUNT(DISTINCT "player1"."id") AS "lowPlayerCount",
// COUNT(DISTINCT "player2"."id") AS "highPlayerCount"
// FROM "team"
// LEFT JOIN "player" "player1" ON ("player1"."teamId" = "team"."id") AND ("player1"."score" < 500)
// LEFT JOIN "player" "player2" ON ("player2"."teamId" = "team"."id") AND ("player2"."score" >= 500)
// GROUP BY "team"."id"
let request = Team
.annotated(with: Team.players.filter(Column("score") < 500).forKey("lowPlayers").count)
.annotated(with: Team.players.filter(Column("score") >= 500).forKey("highPlayers").count)
The two joins are messing with each other.
In #778, I have tried to implement association aggregates with subqueries. It kind of works, but... I fear that this is suboptimal SQL, and the Django precedent tells me that joins are the way to go.
Maybe GRDB should recognize queries where aggregates will be wrong, and simply fatalError
with the "not implemented" message. This would already be an improvement. There are already plenty of such safe guards in GRDB.
And we can deal with count
, min
and max
. Only sum
and avg
are subject to wrong results.
I'll sleep on this.
@simensol, I have difficulties merging all the information you have provided, and build up a reproducible case. For example, https://github.com/groue/GRDB.swift/issues/777#issuecomment-629152830 does not match the initial post (what is this tagCount)?
Could you please give a hand, and provide a minimum reproducible case? I don't care about the data, but only about the schema and the request. Please remove anything that is not strictly needed to get a wrong sum, if you want to be extra helpful :-)
Thank you!
Sorry for the confusion, @groue! Here is the minimum schema:
Task.swift:
struct Task: Identifiable {
var id = UUID()
var projectId: UUID?
var title: String
var status: TaskStatus
}
extension Task: TableRecord {
// Projects and tasks.
static let subtasks = hasMany(Task.self, key: "subtasks")
static let project = belongsTo(Task.self, key: "project")
}
extension Task: Codable, FetchableRecord, MutablePersistableRecord {
// Define database columns from CodingKeys.
enum Columns {
static let id = Column(CodingKeys.id)
static let projectId = Column(CodingKeys.projectId)
static let title = Column(CodingKeys.title)
static let status = Column(CodingKeys.status)
}
}
Tag.swift:
struct Tag: Identifiable {
var id = UUID()
var name: String
}
extension Tag: TableRecord {
static let taskTags = hasMany(TaskTag.self)
static let tasks = hasMany(Task.self, through: taskTags, using: TaskTag.task)
}
extension Tag: Codable, FetchableRecord, MutablePersistableRecord {
enum Columns {
static let id = Column(CodingKeys.id)
static let name = Column(CodingKeys.name)
}
}
TaskTag.swift:
struct TaskTag: Codable, FetchableRecord, MutablePersistableRecord {
var taskId: UUID
var tagId: UUID
}
extension TaskTag: TableRecord {
static let task = belongsTo(Task.self)
static let tag = belongsTo(Tag.self)
}
extension TaskTag {
enum Columns {
static let taskId = Column(CodingKeys.taskId)
static let tagId = Column(CodingKeys.tagId)
}
}
And here is a a minimum request:
let request = Task
.all()
.including(optional: Task.project)
.including(all: Task.subtasks)
.including(all: Task.tags)
.annotated(with: Task.subtasks.sum(Task.Columns.status == TaskStatus.Active).forKey("remainingSubtaskCount"))
My original code is very complex, so this is cut-and-paste (that is, not complied). Let me know if you need anything more!
Let me know if you need anything more!
Actually, I do. I just can't reproduce wrong sum
values. I need a full reproducible case.
I would really appreciate if the case were minimal, which means that all code that is not strictly necessary to exhibit a wrong sum
is removed.
Sorry if I bother you, @simensol, but I would greatly appreciate if you would spend a few minutes setting up a reproducible case. This issue is quite concerning, and I would not like if other GRDB users would be affected. Of course, there is no obligation. But is it possible that you give a hand?
Sorry for the delayed answer, @groue. I really would provided you with a reproducible case. However, unfortunately I don't have any commits from back then. Hence, I think my previous post is the best I can do. I'm very sorry for this.
Thank you @simensol. Unfortunately this post does not contain enough information - for me at least. Maybe you can reproduce the failure setup from this previous post of yours, since you have a better recognition of the state of your app at this stage? As I said, this issue is concerning, but I am no wizard, and your help would be greatly appreciated.
Closing due to lack of activity. Let's wait for the next bug report, and I won't forget to ask for a reproducible case ;-)
Sorry for being quiet for so long, @groue. Due to a new permanent job, I havn't had time to do any iOS development since May. Hopefully I will be able to continue on my project (where I use GRDB) later. I'll promise to give it a new try when I get back in business :)
I have a
Task
record which has aHasManyThrough
association (many-to-many connection) with aTag
record. TheTask
record also has aHasMany
association (one-to-many connection) to otherTask
records using Self Joins. Thus, theTask
record has two types:project
andsubtask
:I want to select all projects and associated tags and subtasks. The following works as expected:
I also want to include the total count of subtasks. Thus, I have added
annotated()
below:totalSubtaskCount
is correct. However, I also want to include the total count of active subtasks. To achieve this, I have tried to usesum()
:If the number of associated tags is equal to 1,
totalSubtaskCount
is correct. However, if the number of associated tags is > 1, the sum is multiplied by the number of associated tags. For example, if a project has 3 associated subtasks and 2 associated tags,totalSubtaskCount
becomes3*2=6
when the correct number is 3.Environment
GRDB flavor(s): GRDB + GRDBCombine GRDB version: 5.0.0-beta.2 Installation method: SPM Xcode version: 11.5 beta 2 (11N605f) Swift version: 5.2.2 Platform(s) running GRDB: iOS **macOS version running Xcode: 10.15.4 (19E287)