Closed mattecapu closed 8 years ago
Recently @chandu0101 created a nice and relatively big example of a relay application which uses MongoDB and sangria (scala GraphQL implementation).
If you don't mind looking at some scala code, then I would suggest to check out this function:
It creates a relay Connection
which is based on MongoDB query. It also properly handles relay-style pagination and the result set, which comes from the DB, always contains only relevant documents. This particular example uses mongo's skip
and limit
to archive this. But I can also imagine other strategies to do this - for instance using an ID
-based cursors which would be even more efficient.
Even though this example uses sangria-relay, conceptually it is very similar to reference implementation.
From all the solutions I found on the internet (including this one). It seems it does not work if data is inserted in the database during the lifetime of a cursor.
Am I wrong ?
Is there a generic solution ?
I thought about a solution to solve the data insertion:
I have a compliant implementation for Mongoose database I can share if anyone interested.
My only concern Is how long should I keep the queries in my redis server.
@mattecapu this package relay-mongodb-connection creates relay connection from mongodb cursors, it also uses skip
and limit
Thank you @sibelius and @OlegIlyenko, your links were a good starting place for understanding what was going on in the creation of a ConnectionType response.
My solution was to not use createConnection*
utilities but to create a ConnectionType response by myself.
Essentially I encode information about the ordering field (i.e. the ID) in the cursor and then use that to query results with ID >= or <= than cursor. The first
and last
parameters are handled with some clever LIMIT
ing and order reversing.
My actual use case was a little more complicated because I'm fetching from multiple tables. Thus after retrieving the minimum required records I'm doing an additional round of sorting and cutting.
I'd be happy to provide some code examples if someone needs them.
@mattecapu I think u should provide some code examples to help people understand better how to implement a cursor
Well the point is, if you implement a ConnectionType for an endpoint you just need to
connectionArgs
parameters{ edges: [{ cursor, node }], pageInfo: { hasNextPage, hasPreviousPage, startCursor, endCursor }
Cursors can be any string you want. Normally Relay defaults to arrayconnection:<id>
converted to base64 to make it opaque (you shouldn't actually relay on the implementation details of the cursors).
I used <type_name>:<id>
converted to base64 as cursors.
Then a request to my endpoint look like this
endpoint(after: "sdflkjsdlfkjslkdf", first: 10) {
# stuff
}
Basically, any request described by the specification is supported.
Thus when I get the request I process it in the following way:
SELECT * FROM table
after
argument is provided, add id > parsed_cursor
to the WHERE
clausebefore
argument is provided, add id < parsed_cursor
to the WHERE
clausefirst
argument is provided, add ORDER BY id DESC LIMIT first+1
to the querylast
argument is provided, add ORDER BY id ASC LIMIT last+1
to the querylast
argument is provided, I reverse the order of the resultsfirst
argument is provided then I set hasPreviousPage: false
(see spec for a description of this behavior).first+1
results are returned, I set hasNextPage: true
, otherwise I set it to false
.last
argument is provided then I set hasNextPage: false
(see spec for a description of this behavior).last+1
results are returned, I set hasPreviousPage: true
, otherwise I set it to false
.Using this "algorithm", only the needed data is fetched. While after
and before
can be both set, I make sure first
and last
args are treated as mutually exclusive to avoid making a mess. The spec itself discourage making requests with both the arguments set.
Notably, I return an object with the shape described above (and in the linked spec) and I don't use the connectionFromArray
helper, which expects a raw collection to slice accordingly to the args.
EDIT: The above algorithm produces array with different orders depending on which of first
or last
is used to paginate. This is against the specification, so there should be one more step:
- Re-order results in a consistent manner (e.g. by creation date)
Thanks to @Sytten for bringing this problem to my attention.
EDIT2: The above ordering reflects pagination, not chronological order. Hence first
is interpreted as 'first to be displayed', not 'first chronologically'. Therefore, you might want to swap ASC
and DESC
in step (4) and (5). Thanks to @Sytten for bringing this problem to my attention.
That's a really great write-up @mattecapu. I'm going to close this out now but that is exactly the kind of thing that would work well in the documentation (could be something in a code comment, in the README
, or we could start an examples
folder). Would be happy to look at a PR adding something like that.
@mattecapu It seems the algorithm does not work if there was an deletion in the database between two paginated queries and/or if we want to sort by anything else than Id's right ?
@mattecapu what happens when the ids are not ordered? i.e they are UUIDs, id > paresedValue
will be useless.
Sorry @GuillaumeLeclerc I lost your question in a notification misunderstanding with GitHub. I'm going to address your question together with @oexza new comment.
As laid out in the comment above, my algorithm as both the limitations you guys noticed, but we can easily generalize it to overcome them.
My idea is to use creation timestamps as "order-providers", instead of abusing of the id
field, which by the way, isn't guaranteed to be sequential anyway (in my specific use case, it was).
To support dynamic data getting deleted or added inbetween queries, paginating with creationTime
works like a charm, especially with Relay caching mechanism.
Eventually, we can provide a further generalization by not fixing an order-provider field but let it be dynamic, effectively allowing a lot of different orders on the data, which can come in handy. This is pretty simple to implement too, but gets complicated once you have to support dynamic data. I supect there's no silver bullet for this use case, but I might be wrong.
@wincent I'll see what I can do! Thanks for the appreciation.
I came across this issue and I wanted to share a Node.js library that I created a while ago. https://github.com/joonhocho/graphql-relay-connection It helps you easily create a custom relay connections. You can create connections for MongoDB, simple number arrays, etc by providing few functions. I hope this can help.
@mattecapu Great post! Just wanted to say it helped out a lot. We had some tweaking to do since we accept arguments other than the Relay args spec.
Everything was working great with @mattecapu's solution, but I ran into one rare situation as such and am in quite a pickle... wondering if any of you have run into this issue.
We have one query for events that returns data in a specific order.
This specific order, which is passed as part of an SQL statement, includes things like how many times that event's page has been viewed, whether it has an promotional image, etc. Basically about 5 *** DESC
s in ORDER BY
.
We have an events query like this:
{
viewer {
events(first: 1, inCountry: Japan) {
edges {
node {
id
name
}
}
}
}
}
first
and last
can be dealt with as addressed above; flip all of the multiple ORDER BY DESC
to ORDER BY ASC
.
But when you introduce after
& before
, it is very difficult to serve the correct data that corresponds to the given Node ID, as due to the nature of our query, it is not as simple as saying event.id > 1
or event.createAt > xxxx
when after
& before
are passed.
@Naoto-Ida if I understood correctly and your order is deterministic, createAt
attribute should do the trick.
How do you compute your cursor
s?
@mattecapu
Our ORDER statement in our SQL would include something like: ORDER BY
createAt
. Sorry, maybe my example of using createAt
was misleading.
What I was trying to get at was that your method works great in situations where ordering is based on a single factor, but not when there are multiple factors in play.We compute it by base64decoding it, and splitting it into object:ID
. From there we get that one record and get the ones before/after that.
So in the end, due to there not being that many records and time constraints, we redis cached the all the event record. We fetch it when a query with the same arguments come in, then based off of the supplied cursor, would slice the total records and serve the ones before/after it.
Now I see the problem.
When I was writing my algorithm, my implicit assumption was that between two different queries, the only change that could happen to the database was addition or removal of records.
While this may be a good-enough assumption for most cases, it fails if, like in your case, there's a third way to mutate the state of your database: updating a record.
That messes up with my algorithm because I cannot tell wether a given record was changed from the last time I queried the db.
Yet, this is not a problem specific to my algorithm. In general, there's no way to do this without storing information about updates.
Likewise, without storing a createdAt
attribute there's no way to tell if a given record was there when I first run the query (apart from using other time-related information stored in the record, such as a sequential ID).
So the only way to support such constraint is to change the architecture of your db, adopting some kind of time-traveling structure.
Eventually, this is not as difficult as it sounds.
For the general case, you'll have to store every update you make to your db.
So your records will now look like this
object_id | updated_attribute | new_value | timestamp
Your data will be shattered into several of this atomic operations, which aggregated togheter will give you the most recent version of it. So for example if I want to retrieve attribute name
of object 34
I'll ask my db
SELECT new_value FROM mytable WHERE object_id='34' ORDER BY timestamp DESC LIMIT 1
But now If I want to know which state my db was as a given time, I'll simply exclude all updates done after a specific timestamp_
SELECT new_value FROM mytable WHERE object_id='34' AND timestamp < GIVEN_TIMESTAMP
ORDER BY timestamp DESC LIMIT 1
Voilà, I can now run queries against any version of my DB. Obviously this is a simplification, you'll have a lot of issues to address (or better, operations to translate) to support this kind of architecture, but it's the only/best way to support such problem constraints.
Speaking for your specific case, @Naoto-Ida, I think you could get away with a far less disruptive change: create a table views(post_id, timestamp)
and calculate amountOfViews
at query-time by aggregating with the foreign key of your main table. That way you can filter by timestamp
and retrieve the state (and thus the order) you had at that time.
- If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query
@mattecapu why +1
to limit
number? I think it is just as equal as first argument.
@luckydrq You add +1
to get the boolean for pageInfo.hasNextPage
@luckydrq yeah it basically peeks at the next page to see if there is one.
Re-reading my last post, it comes to me there's a less invasive way to support updates, just use an updatedAt
attribute. Use it the same way you use the createdAt attribute, and update it on updates.
I implemented some helper functions (namely paginate) so that SQL support would be easy (ordering, filtering supported). I followed @mattecapu 's suggested approach.
usage:
import {
GraphQLObjectType,
} from 'graphql'
import {
connectionArgs,
} from 'graphql-relay'
import * as helpers from 'the-gist-linked-above' // not an actual npm lib yet :P
// helper
const connectionType = nodeType => connectionDefinitions({nodeType}).connectionType
const Query = new GraphQLObjectType({
name: 'Query',
fields: () => ({
// ... other queries here
things: {
type: connectionType(Thing),
args: connectionArgs,
resolve: (_, paginationArgs) => {
// you could get `orderBy` from args, but just hard-coded here for simplicity
return helpers.paginate(models.Thing, paginationArgs, {orderBy: [['name', 'ASC']]})
}
},
})
})
It's currently coupled with Sequelize
, but could pretty easily be decoupled.
https://gist.github.com/pcattori/2bb645d587e45c9fdbcabf5cef7a7106
We use this in production
https://github.com/entria/graphql-mongoose-loader
it solves pagination and dataloader for mongo, using mongoose.
we have the same concept for other datasources, as REST api, SQL (oracle and postgres), very easy to extend to any datasource.
@mattecapu I read your guide how to handle GraphQL connections with SQL. It's pretty much what I came up with when I was analyzing it myself (+ some details like how to handle hasNextPage).
The problem is that the SQL query gets really complicated really fast when I add an optional sorting argument to the GraphQL connection - especially if the sorting can be a combination of fields.
Do you have any tips how to handle that and how to do it efficiently?
Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics? I will assume you mean 'the query it writes is an ugly monster made up of a lot of chunks'. Unfortunately, my algorithm (feels weird to call it in such a way...) is very straightforward, and I really can't think of any ways to make it shorter or to improve its 'efficiency'. It really just translates between GraphQL and SQL, so I don't see room of improvement without pruning feature from your endpoint. That said, using SQL builders means that you never deal with the raw, horrific string, and using an ORM means an even nicer interface (I use Sequelize). I would hint at this as a good practice, because it abstracts away much of the trouble.
Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics?
Primarily execution time and efficient usage of indexes on that table.
Secondary is the SQL query length and number of conditions in WHERE clause but I'm already using an SQL builder so I can deal with that myself.
Thank you @mattecapu. I had the same "algorithm" in my head, but I forgot to reverse the result when using last / before.. which makes senses from the UI point of view.
If someone is interested, here is my current implementation with Node, Apollo, and Knex on top of MySQL. Constructive feedbacks are more than welcome!
I took some shortcuts that I could improve in the future:
const userSchema = gql`
type User implements Node {
id: ID!
email: String!
...
}
extend type Query {
...
usersPaginated(input: UserPaginatedInput!): UserPaginatedConnection
}
input UserPaginatedInput {
first: Int
after: ID
last: Int
before: ID
}
type UserPaginatedConnection {
pageInfo: PageInfo
edges: [UserPaginatedEdge]
}
type PageInfo {
hasNextPage: Boolean
endCursor: ID
hasPreviousPage: Boolean
startCursor: ID
}
type UserPaginatedEdge {
cursor: ID
node: User
}
...
`
const userResolvers = {
...
usersPaginated: async (parent, args, { db }, info) => {
try {
const { first, after, last, before } = args.input
let [hasNextPage, endCursor, hasPreviousPage, startCursor] = [false, null, false, null]
let res = []
if (!!first && !!last)
throw new ApolloError('Ambiguous query: first and last should not be used together')
if (!first && !last)
throw new ApolloError('Missing first or last argument')
const query = db.from('gql_user')
if (!!first) {
if (!!after)
query.where('id', '>', after)
query.limit(first + 1)
.orderBy('id', 'asc')
res = await query
if (res.length > first) {
hasNextPage = true
res = res.slice(0, first)
}
}
else if (!!last) {
if (!!before)
query.where('id', '<', before)
query.limit(last + 1)
.orderBy('id', 'desc')
res = await query
if (res.length > last) {
hasPreviousPage = true
res = res.slice(0, last)
}
res.reverse()
}
startCursor = res[0].id
endCursor = res[res.length - 1].id
const pageInfo = {
hasNextPage,
endCursor,
hasPreviousPage,
startCursor
}
let edges = []
res.map(row => {
edges.push({
cursor: row.id,
node: row
})
})
return {
pageInfo,
edges
}
} catch (err) {
throw new ApolloError(err.sqlMessage, err.code, err)
}
}
...
}
Sorry for spamming, but I'd like to thank @mattecapu for his explanation of the algorithm in one of his prior posts. I was mislead by these cursors based on static arrays and noticed that there must be a different way for pagination of dynamic data such as data from a database where data gets deleted in between pages / accessing different pages during pagination causing anomalies. The detailed description really helped to come up with a correct solution in my project.
@mattecapu unless I am mistaken steps 4 and 5 are using the wrong ORDER BY (when using first, it should be ASC and vice-versa). Can you fix so people that find this thread are not more confused than necessary 😅
Hi @Sytten, re-reading the specification now it seems that ordering is not dependent on the field, but should be decided by the endpoint.
The ORDER BY
clause in my algorithm is there to select the right chunk of results, but a side-effect is that the order gets reversed when using last
as opposed to first
! So I'll add a mention to this in the original comment.
Thanks!
@mattecapu Good that you found that too, but that is not really my point :) Check the coded algo from https://github.com/graphql/graphql-relay-js/issues/94#issuecomment-613137054, you will see that the ORDER BY is the inverse of your algo (first -> ASC, last -> DESC) and for a good reason. If you take N+1 DESC it wont give you the right dataset assuming an autoincrement Int for first and vice versa. Unless you consider first being the highest IDs then it would make sense but I would specify it.
@Sytten I see. It's quite misleading, indeed. I was paginating in reverse chronological order (as most feeds do nowadays) so 'first' to me meant 'first to be displayed'. This explains my choice of ordering, I will edit to clarify.
Hi @mattecapu sorry for bringing up an old thread, but I need some help.
I am trying to implement relay pagination with postgres and I am facing the issue where I want to sort on multiple columns. Let us say I have products table. I have an ID column, which is globally unique and auto increments every time something is added. Sorting based on this ID is very easy using the algorithm mentioned above. But what if I want to sort on multiple columns? For example, I want to sort on the Name column (to get products in alphabetical order), and then start executing pagination queries. The ID cursor no longer makes sense, because we are sorting on a different field now. So do I make a name cursor and use that? What if I want to sort based on three or more columns? How would I create a cursor then?
I have created a stock overflow question as well: https://stackoverflow.com/questions/72011183/how-to-implement-graphql-relay-style-cursor-based-pagination-in-postgres-with-s , I don't need answers specific to python, but a recommended general approach is appreciated.
Well the point is, if you implement a ConnectionType for an endpoint you just need to
- accept
connectionArgs
parameters- return a suitable response in the shape
{ edges: [{ cursor, node }], pageInfo: { hasNextPage, hasPreviousPage, startCursor, endCursor }
Cursors can be any string you want. Normally Relay defaults to
arrayconnection:<id>
converted to base64 to make it opaque (you shouldn't actually relay on the implementation details of the cursors).I used
<type_name>:<id>
converted to base64 as cursors.Then a request to my endpoint look like this
endpoint(after: "sdflkjsdlfkjslkdf", first: 10) { # stuff }
Basically, any request described by the specification is supported.
Thus when I get the request I process it in the following way:
- Start from the greedy query:
SELECT * FROM table
- If the
after
argument is provided, addid > parsed_cursor
to theWHERE
clause- If the
before
argument is provided, addid < parsed_cursor
to theWHERE
clause- If the
first
argument is provided, addORDER BY id DESC LIMIT first+1
to the query- If the
last
argument is provided, addORDER BY id ASC LIMIT last+1
to the query- If the
last
argument is provided, I reverse the order of the results- If the
first
argument is provided then I sethasPreviousPage: false
(see spec for a description of this behavior).- If no less than
first+1
results are returned, I sethasNextPage: true
, otherwise I set it tofalse
.- If the
last
argument is provided then I sethasNextPage: false
(see spec for a description of this behavior).- If no less
last+1
results are returned, I sethasPreviousPage: true
, otherwise I set it tofalse
.Using this "algorithm", only the needed data is fetched. While
after
andbefore
can be both set, I make surefirst
andlast
args are treated as mutually exclusive to avoid making a mess. The spec itself discourage making requests with both the arguments set.Notably, I return an object with the shape described above (and in the linked spec) and I don't use the
connectionFromArray
helper, which expects a raw collection to slice accordingly to the args.EDIT: The above algorithm produces array with different orders depending on which of
first
orlast
is used to paginate. This is against the specification, so there should be one more step:
- Re-order results in a consistent manner (e.g. by creation date)
Thanks to @Sytten for bringing this problem to my attention.
EDIT2: The above ordering reflects pagination, not chronological order. Hence
first
is interpreted as 'first to be displayed', not 'first chronologically'. Therefore, you might want to swapASC
andDESC
in step (4) and (5). Thanks to @Sytten for bringing this problem to my attention.
This is what I do, using TypeORM and type-graphql.
@Query(returns => UserConnection)
async getUsers(@Args() conn: UserConnectionArgs): Promise<UserConnection> {
const result = new UserConnection()
let userQuery = this.userRepository.createQueryBuilder()
if (conn.search) {
userQuery = userQuery.where('username LIKE :search', { search: `%${conn.search}%` })
}
if (conn.after) {
userQuery = userQuery.where('id > :id', { id: conn.decodedAfter })
}
if (conn.before) {
userQuery = userQuery.where('id < :id', { id: conn.decodedBefore })
}
if (!conn.first && !conn.last) conn.first = 10
if (conn.first || !conn.last) {
userQuery = userQuery.orderBy('id', 'ASC')
conn.first ||= 10
const users = await userQuery.take(conn.first + 1).getMany()
result.edges = users.slice(0, conn.first).map(u => ({ node: u, cursor: btoa(u.id.toString()) }))
result.pageInfo.hasNextPage = users.length > conn.first
result.pageInfo.hasPreviousPage = false
result.pageInfo.startCursor = result.edges[0]?.cursor || null
result.pageInfo.endCursor = result.edges[result.edges.length - 1]?.cursor || null
}
if (conn.last) {
userQuery = userQuery.orderBy('id', 'DESC')
let users = await userQuery.take(conn.last + 1).getMany()
users = users.reverse()
result.edges = users.slice(0, conn.last).map(u => ({ node: u, cursor: btoa(u.id.toString()) }))
result.pageInfo.hasNextPage = false
result.pageInfo.hasPreviousPage = users.length > conn.last
result.pageInfo.startCursor = result.edges[0]?.cursor || null
result.pageInfo.endCursor = result.edges[result.edges.length - 1]?.cursor || null
}
return result
}
In all the examples I can find, paginated queries are made against a mockup database which is just a JS array, and thus it is simply passed through
connectionFromArray
to return the correct paginated result (like the Star Wars example mentioned in the README). For a real-life database, query all records and then pass them toconnectionFromPromisedArray
doesn't seem to be a good solution, as it will easily break your perfomance/crash your server as soon as you're doing anything at (even modest) scaleSo what solutions should you use to avoid insane database fetching?
(I'm using a SQL database but I think a good solution to this problem applies to pretty much every not-a-js-array dbms)