Pagination implementation for real-life database

mattecapu commented 8 years ago

In all the examples I can find, paginated queries are made against a mockup database which is just a JS array, and thus it is simply passed through connectionFromArray to return the correct paginated result (like the Star Wars example mentioned in the README). For a real-life database, query all records and then pass them to connectionFromPromisedArray doesn't seem to be a good solution, as it will easily break your perfomance/crash your server as soon as you're doing anything at (even modest) scale

So what solutions should you use to avoid insane database fetching?

(I'm using a SQL database but I think a good solution to this problem applies to pretty much every not-a-js-array dbms)

OlegIlyenko commented 8 years ago

Recently @chandu0101 created a nice and relatively big example of a relay application which uses MongoDB and sangria (scala GraphQL implementation).

If you don't mind looking at some scala code, then I would suggest to check out this function:

https://github.com/chandu0101/sri-sangria-example/blob/master/server/src/main/scala/sri/sangria/mongoserver/services/BaseService.scala#L52-L99

It creates a relay Connection which is based on MongoDB query. It also properly handles relay-style pagination and the result set, which comes from the DB, always contains only relevant documents. This particular example uses mongo's skip and limit to archive this. But I can also imagine other strategies to do this - for instance using an ID-based cursors which would be even more efficient.

Even though this example uses sangria-relay, conceptually it is very similar to reference implementation.

GuillaumeLeclerc commented 8 years ago

From all the solutions I found on the internet (including this one). It seems it does not work if data is inserted in the database during the lifetime of a cursor.

Am I wrong ?

Is there a generic solution ?

GuillaumeLeclerc commented 8 years ago

I thought about a solution to solve the data insertion:

If there is no before nor after cursor I execute the database query but I only select the Ids, not the content. generate a new queryId (uuidv4) and save the Id's I get to a redis hastable where the key is the queryId
If there is a cursor I extract the queryId from the cursor and get the list of Id's from the database
I use the query args to compute which Id's the client is interested in
I make a database query including the corresponding Ids
Return the result

I have a compliant implementation for Mongoose database I can share if anyone interested.

My only concern Is how long should I keep the queries in my redis server.

sibelius commented 8 years ago

@mattecapu this package relay-mongodb-connection creates relay connection from mongodb cursors, it also uses skip and limit

mattecapu commented 8 years ago

Thank you @sibelius and @OlegIlyenko, your links were a good starting place for understanding what was going on in the creation of a ConnectionType response.

My solution was to not use createConnection* utilities but to create a ConnectionType response by myself.

Essentially I encode information about the ordering field (i.e. the ID) in the cursor and then use that to query results with ID >= or <= than cursor. The first and last parameters are handled with some clever LIMITing and order reversing.

My actual use case was a little more complicated because I'm fetching from multiple tables. Thus after retrieving the minimum required records I'm doing an additional round of sorting and cutting.

I'd be happy to provide some code examples if someone needs them.

sibelius commented 8 years ago

@mattecapu I think u should provide some code examples to help people understand better how to implement a cursor

mattecapu commented 8 years ago

Well the point is, if you implement a ConnectionType for an endpoint you just need to

accept connectionArgs parameters
return a suitable response in the shape { edges: [{ cursor, node }], pageInfo: { hasNextPage, hasPreviousPage, startCursor, endCursor }

Cursors can be any string you want. Normally Relay defaults to arrayconnection:<id> converted to base64 to make it opaque (you shouldn't actually relay on the implementation details of the cursors).

I used <type_name>:<id> converted to base64 as cursors.

Then a request to my endpoint look like this

endpoint(after: "sdflkjsdlfkjslkdf", first: 10) {
   # stuff
}

Basically, any request described by the specification is supported.

Thus when I get the request I process it in the following way:

Start from the greedy query: SELECT * FROM table
If the after argument is provided, add id > parsed_cursor to the WHERE clause
If the before argument is provided, add id < parsed_cursor to the WHERE clause
If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query
If the last argument is provided, add ORDER BY id ASC LIMIT last+1 to the query
If the last argument is provided, I reverse the order of the results
If the first argument is provided then I set hasPreviousPage: false (see spec for a description of this behavior).
If no less than first+1 results are returned, I set hasNextPage: true, otherwise I set it to false.
If the last argument is provided then I set hasNextPage: false (see spec for a description of this behavior).
If no less last+1 results are returned, I set hasPreviousPage: true, otherwise I set it to false.

Using this "algorithm", only the needed data is fetched. While after and before can be both set, I make sure first and last args are treated as mutually exclusive to avoid making a mess. The spec itself discourage making requests with both the arguments set.

Notably, I return an object with the shape described above (and in the linked spec) and I don't use the connectionFromArray helper, which expects a raw collection to slice accordingly to the args.

EDIT: The above algorithm produces array with different orders depending on which of first or last is used to paginate. This is against the specification, so there should be one more step:

Re-order results in a consistent manner (e.g. by creation date)

Thanks to @Sytten for bringing this problem to my attention.

EDIT2: The above ordering reflects pagination, not chronological order. Hence first is interpreted as 'first to be displayed', not 'first chronologically'. Therefore, you might want to swap ASC and DESC in step (4) and (5). Thanks to @Sytten for bringing this problem to my attention.

wincent commented 8 years ago

That's a really great write-up @mattecapu. I'm going to close this out now but that is exactly the kind of thing that would work well in the documentation (could be something in a code comment, in the README, or we could start an examples folder). Would be happy to look at a PR adding something like that.

GuillaumeLeclerc commented 8 years ago

@mattecapu It seems the algorithm does not work if there was an deletion in the database between two paginated queries and/or if we want to sort by anything else than Id's right ?

oexza commented 8 years ago

@mattecapu what happens when the ids are not ordered? i.e they are UUIDs, id > paresedValue will be useless.

mattecapu commented 8 years ago

Sorry @GuillaumeLeclerc I lost your question in a notification misunderstanding with GitHub. I'm going to address your question together with @oexza new comment.

As laid out in the comment above, my algorithm as both the limitations you guys noticed, but we can easily generalize it to overcome them. My idea is to use creation timestamps as "order-providers", instead of abusing of the id field, which by the way, isn't guaranteed to be sequential anyway (in my specific use case, it was).

To support dynamic data getting deleted or added inbetween queries, paginating with creationTime works like a charm, especially with Relay caching mechanism.

Eventually, we can provide a further generalization by not fixing an order-provider field but let it be dynamic, effectively allowing a lot of different orders on the data, which can come in handy. This is pretty simple to implement too, but gets complicated once you have to support dynamic data. I supect there's no silver bullet for this use case, but I might be wrong.

@wincent I'll see what I can do! Thanks for the appreciation.

joonhocho commented 8 years ago

I came across this issue and I wanted to share a Node.js library that I created a while ago. https://github.com/joonhocho/graphql-relay-connection It helps you easily create a custom relay connections. You can create connections for MongoDB, simple number arrays, etc by providing few functions. I hope this can help.

Naoto-Ida commented 7 years ago

@mattecapu Great post! Just wanted to say it helped out a lot. We had some tweaking to do since we accept arguments other than the Relay args spec.

Naoto-Ida commented 7 years ago

Everything was working great with @mattecapu's solution, but I ran into one rare situation as such and am in quite a pickle... wondering if any of you have run into this issue.

We have one query for events that returns data in a specific order. This specific order, which is passed as part of an SQL statement, includes things like how many times that event's page has been viewed, whether it has an promotional image, etc. Basically about 5 *** DESCs in ORDER BY.

We have an events query like this:

{
  viewer {
    events(first: 1, inCountry: Japan) {
      edges {
        node {
          id
          name
        }
      }
    }
  }
}

first and last can be dealt with as addressed above; flip all of the multiple ORDER BY DESC to ORDER BY ASC.

But when you introduce after & before, it is very difficult to serve the correct data that corresponds to the given Node ID, as due to the nature of our query, it is not as simple as saying event.id > 1 or event.createAt > xxxx when after & before are passed.

mattecapu commented 7 years ago

@Naoto-Ida if I understood correctly and your order is deterministic, createAt attribute should do the trick. How do you compute your cursors?

Naoto-Ida commented 7 years ago

@mattecapu

Our ORDER statement in our SQL would include something like: ORDER BY

closest to user location
amount of views
whether its free, etc. Or something like that. It would differ drastically depending on the user and "when" you query it, so it does not work off the createAt. Sorry, maybe my example of using createAt was misleading. What I was trying to get at was that your method works great in situations where ordering is based on a single factor, but not when there are multiple factors in play.

We compute it by base64decoding it, and splitting it into object:ID. From there we get that one record and get the ones before/after that.

So in the end, due to there not being that many records and time constraints, we redis cached the all the event record. We fetch it when a query with the same arguments come in, then based off of the supplied cursor, would slice the total records and serve the ones before/after it.

mattecapu commented 7 years ago

Now I see the problem. When I was writing my algorithm, my implicit assumption was that between two different queries, the only change that could happen to the database was addition or removal of records. While this may be a good-enough assumption for most cases, it fails if, like in your case, there's a third way to mutate the state of your database: updating a record. That messes up with my algorithm because I cannot tell wether a given record was changed from the last time I queried the db. Yet, this is not a problem specific to my algorithm. In general, there's no way to do this without storing information about updates. Likewise, without storing a createdAt attribute there's no way to tell if a given record was there when I first run the query (apart from using other time-related information stored in the record, such as a sequential ID). So the only way to support such constraint is to change the architecture of your db, adopting some kind of time-traveling structure. Eventually, this is not as difficult as it sounds. For the general case, you'll have to store every update you make to your db. So your records will now look like this

object_id | updated_attribute | new_value | timestamp

Your data will be shattered into several of this atomic operations, which aggregated togheter will give you the most recent version of it. So for example if I want to retrieve attribute name of object 34 I'll ask my db

SELECT new_value FROM mytable WHERE object_id='34' ORDER BY timestamp DESC LIMIT 1

But now If I want to know which state my db was as a given time, I'll simply exclude all updates done after a specific timestamp_

SELECT new_value FROM mytable WHERE object_id='34' AND timestamp < GIVEN_TIMESTAMP
 ORDER BY timestamp DESC LIMIT 1

Voilà, I can now run queries against any version of my DB. Obviously this is a simplification, you'll have a lot of issues to address (or better, operations to translate) to support this kind of architecture, but it's the only/best way to support such problem constraints.

Speaking for your specific case, @Naoto-Ida, I think you could get away with a far less disruptive change: create a table views(post_id, timestamp) and calculate amountOfViews at query-time by aggregating with the foreign key of your main table. That way you can filter by timestamp and retrieve the state (and thus the order) you had at that time.

luckydrq commented 6 years ago

If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query

@mattecapu why +1 to limit number? I think it is just as equal as first argument.

Naoto-Ida commented 6 years ago

@luckydrq You add +1 to get the boolean for pageInfo.hasNextPage

mattecapu commented 6 years ago

@luckydrq yeah it basically peeks at the next page to see if there is one.

Re-reading my last post, it comes to me there's a less invasive way to support updates, just use an updatedAt attribute. Use it the same way you use the createdAt attribute, and update it on updates.

pcattori commented 6 years ago

I implemented some helper functions (namely paginate) so that SQL support would be easy (ordering, filtering supported). I followed @mattecapu 's suggested approach.

usage:

import {
    GraphQLObjectType,
} from 'graphql'
import {
    connectionArgs,
} from 'graphql-relay'

import * as helpers from 'the-gist-linked-above' // not an actual npm lib yet :P

// helper
const connectionType = nodeType => connectionDefinitions({nodeType}).connectionType

const Query = new GraphQLObjectType({
    name: 'Query',
    fields: () => ({
        // ... other queries here
        things: {
            type: connectionType(Thing),
            args: connectionArgs,
            resolve: (_, paginationArgs) => {
                // you could get `orderBy` from args, but just hard-coded here for simplicity
                return helpers.paginate(models.Thing, paginationArgs, {orderBy: [['name', 'ASC']]})
            }
        },
    })
})

It's currently coupled with Sequelize, but could pretty easily be decoupled.

https://gist.github.com/pcattori/2bb645d587e45c9fdbcabf5cef7a7106

sibelius commented 6 years ago

We use this in production

https://github.com/entria/graphql-mongoose-loader

it solves pagination and dataloader for mongo, using mongoose.

we have the same concept for other datasources, as REST api, SQL (oracle and postgres), very easy to extend to any datasource.

crisu83 commented 6 years ago

@mattecapu Thanks for posting your solution here it was very helpful.

I implemented the same logic in our project on top of our GraphQL and Relay implementations in PHP.

enumag commented 6 years ago

@mattecapu I read your guide how to handle GraphQL connections with SQL. It's pretty much what I came up with when I was analyzing it myself (+ some details like how to handle hasNextPage).

The problem is that the SQL query gets really complicated really fast when I add an optional sorting argument to the GraphQL connection - especially if the sorting can be a combination of fields.

Do you have any tips how to handle that and how to do it efficiently?

mattecapu commented 6 years ago

Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics? I will assume you mean 'the query it writes is an ugly monster made up of a lot of chunks'. Unfortunately, my algorithm (feels weird to call it in such a way...) is very straightforward, and I really can't think of any ways to make it shorter or to improve its 'efficiency'. It really just translates between GraphQL and SQL, so I don't see room of improvement without pruning feature from your endpoint. That said, using SQL builders means that you never deal with the raw, horrific string, and using an ORM means an even nicer interface (I use Sequelize). I would hint at this as a good practice, because it abstracts away much of the trouble.

enumag commented 6 years ago

Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics?

Primarily execution time and efficient usage of indexes on that table.

Secondary is the SQL query length and number of conditions in WHERE clause but I'm already using an SQL builder so I can deal with that myself.

binajmen commented 4 years ago

Thank you @mattecapu. I had the same "algorithm" in my head, but I forgot to reverse the result when using last / before.. which makes senses from the UI point of view.

If someone is interested, here is my current implementation with Node, Apollo, and Knex on top of MySQL. Constructive feedbacks are more than welcome!

I took some shortcuts that I could improve in the future:

Currently the cursor is an ID --> I should use Base64 to apply the opaque principle
In my case, selection based on ID (greater/smaller, order by asc/desc) is sufficient in my current project, but it could be improved to be "generic". Although, it has been mentioned above: having a full bulletproof solution might not be that easy to achieve.

const userSchema = gql`

    type User implements Node {
        id: ID!
        email: String!
        ...
    }

    extend type Query {
        ...
        usersPaginated(input: UserPaginatedInput!): UserPaginatedConnection
    }

    input UserPaginatedInput {
        first: Int
        after: ID
        last: Int
        before: ID
    }

    type UserPaginatedConnection {
        pageInfo: PageInfo
        edges: [UserPaginatedEdge]
    }

    type PageInfo {
        hasNextPage: Boolean
        endCursor: ID
        hasPreviousPage: Boolean
        startCursor: ID
    }

    type UserPaginatedEdge {
        cursor: ID
        node: User
    }
    ...
`

const userResolvers = {
    ...
    usersPaginated: async (parent, args, { db }, info) => {
        try {
            const { first, after, last, before } = args.input
            let [hasNextPage, endCursor, hasPreviousPage, startCursor] = [false, null, false, null]
            let res = []

            if (!!first && !!last)
                throw new ApolloError('Ambiguous query: first and last should not be used together')
            if (!first && !last)
                throw new ApolloError('Missing first or last argument')

            const query = db.from('gql_user')

            if (!!first) {
                if (!!after)
                    query.where('id', '>', after)

                query.limit(first + 1)
                    .orderBy('id', 'asc')

                res = await query

                if (res.length > first) {
                    hasNextPage = true
                    res = res.slice(0, first)
                }
            }

            else if (!!last) {
                if (!!before)
                    query.where('id', '<', before)

                query.limit(last + 1)
                    .orderBy('id', 'desc')

                res = await query

                if (res.length > last) {
                    hasPreviousPage = true
                    res = res.slice(0, last)
                }

                res.reverse()
            }

            startCursor = res[0].id
            endCursor = res[res.length - 1].id

            const pageInfo = {
                hasNextPage,
                endCursor,
                hasPreviousPage,
                startCursor
            }

            let edges = []
            res.map(row => {
                edges.push({
                    cursor: row.id,
                    node: row
                })
            })

            return {
                pageInfo,
                edges
            }
        } catch (err) {
            throw new ApolloError(err.sqlMessage, err.code, err)
        }
    }
    ...
}

ChristianIvicevic commented 4 years ago

Sorry for spamming, but I'd like to thank @mattecapu for his explanation of the algorithm in one of his prior posts. I was mislead by these cursors based on static arrays and noticed that there must be a different way for pagination of dynamic data such as data from a database where data gets deleted in between pages / accessing different pages during pagination causing anomalies. The detailed description really helped to come up with a correct solution in my project.

Sytten commented 3 years ago

@mattecapu unless I am mistaken steps 4 and 5 are using the wrong ORDER BY (when using first, it should be ASC and vice-versa). Can you fix so people that find this thread are not more confused than necessary 😅

mattecapu commented 3 years ago

Hi @Sytten, re-reading the specification now it seems that ordering is not dependent on the field, but should be decided by the endpoint. The ORDER BY clause in my algorithm is there to select the right chunk of results, but a side-effect is that the order gets reversed when using last as opposed to first! So I'll add a mention to this in the original comment. Thanks!

Sytten commented 3 years ago

@mattecapu Good that you found that too, but that is not really my point :) Check the coded algo from https://github.com/graphql/graphql-relay-js/issues/94#issuecomment-613137054, you will see that the ORDER BY is the inverse of your algo (first -> ASC, last -> DESC) and for a good reason. If you take N+1 DESC it wont give you the right dataset assuming an autoincrement Int for first and vice versa. Unless you consider first being the highest IDs then it would make sense but I would specify it.

mattecapu commented 3 years ago

@Sytten I see. It's quite misleading, indeed. I was paginating in reverse chronological order (as most feeds do nowadays) so 'first' to me meant 'first to be displayed'. This explains my choice of ordering, I will edit to clarify.

vamshiaruru-virgodesigns commented 2 years ago

Hi @mattecapu sorry for bringing up an old thread, but I need some help.

I am trying to implement relay pagination with postgres and I am facing the issue where I want to sort on multiple columns. Let us say I have products table. I have an ID column, which is globally unique and auto increments every time something is added. Sorting based on this ID is very easy using the algorithm mentioned above. But what if I want to sort on multiple columns? For example, I want to sort on the Name column (to get products in alphabetical order), and then start executing pagination queries. The ID cursor no longer makes sense, because we are sorting on a different field now. So do I make a name cursor and use that? What if I want to sort based on three or more columns? How would I create a cursor then?

I have created a stock overflow question as well: https://stackoverflow.com/questions/72011183/how-to-implement-graphql-relay-style-cursor-based-pagination-in-postgres-with-s , I don't need answers specific to python, but a recommended general approach is appreciated.

irvanherz commented 2 months ago

Well the point is, if you implement a ConnectionType for an endpoint you just need to

accept connectionArgs parameters

return a suitable response in the shape { edges: [{ cursor, node }], pageInfo: { hasNextPage, hasPreviousPage, startCursor, endCursor }

Cursors can be any string you want. Normally Relay defaults to arrayconnection:<id> converted to base64 to make it opaque (you shouldn't actually relay on the implementation details of the cursors).

I used <type_name>:<id> converted to base64 as cursors.

Then a request to my endpoint look like this
endpoint(after: "sdflkjsdlfkjslkdf", first: 10) {
   # stuff
}
Basically, any request described by the specification is supported.

Thus when I get the request I process it in the following way:

Start from the greedy query: SELECT * FROM table

If the after argument is provided, add id > parsed_cursor to the WHERE clause

If the before argument is provided, add id < parsed_cursor to the WHERE clause

If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query

If the last argument is provided, add ORDER BY id ASC LIMIT last+1 to the query

If the last argument is provided, I reverse the order of the results

If the first argument is provided then I set hasPreviousPage: false (see spec for a description of this behavior).

If no less than first+1 results are returned, I set hasNextPage: true, otherwise I set it to false.

If the last argument is provided then I set hasNextPage: false (see spec for a description of this behavior).

If no less last+1 results are returned, I set hasPreviousPage: true, otherwise I set it to false.

Using this "algorithm", only the needed data is fetched. While after and before can be both set, I make sure first and last args are treated as mutually exclusive to avoid making a mess. The spec itself discourage making requests with both the arguments set.

Notably, I return an object with the shape described above (and in the linked spec) and I don't use the connectionFromArray helper, which expects a raw collection to slice accordingly to the args.

EDIT: The above algorithm produces array with different orders depending on which of first or last is used to paginate. This is against the specification, so there should be one more step:

Re-order results in a consistent manner (e.g. by creation date)

Thanks to @Sytten for bringing this problem to my attention.

EDIT2: The above ordering reflects pagination, not chronological order. Hence first is interpreted as 'first to be displayed', not 'first chronologically'. Therefore, you might want to swap ASC and DESC in step (4) and (5). Thanks to @Sytten for bringing this problem to my attention.

This is what I do, using TypeORM and type-graphql.

@Query(returns => UserConnection)
  async getUsers(@Args() conn: UserConnectionArgs): Promise<UserConnection> {
    const result = new UserConnection()
    let userQuery = this.userRepository.createQueryBuilder()
    if (conn.search) {
      userQuery = userQuery.where('username LIKE :search', { search: `%${conn.search}%` })
    }
    if (conn.after) {
      userQuery = userQuery.where('id > :id', { id: conn.decodedAfter })
    }
    if (conn.before) {
      userQuery = userQuery.where('id < :id', { id: conn.decodedBefore })
    }
    if (!conn.first && !conn.last) conn.first = 10

    if (conn.first || !conn.last) {
      userQuery = userQuery.orderBy('id', 'ASC')
      conn.first ||= 10
      const users = await userQuery.take(conn.first + 1).getMany()
      result.edges = users.slice(0, conn.first).map(u => ({ node: u, cursor: btoa(u.id.toString()) }))
      result.pageInfo.hasNextPage = users.length > conn.first
      result.pageInfo.hasPreviousPage = false
      result.pageInfo.startCursor = result.edges[0]?.cursor || null
      result.pageInfo.endCursor = result.edges[result.edges.length - 1]?.cursor || null
    }
    if (conn.last) {
      userQuery = userQuery.orderBy('id', 'DESC')
      let users = await userQuery.take(conn.last + 1).getMany()
      users = users.reverse()
      result.edges = users.slice(0, conn.last).map(u => ({ node: u, cursor: btoa(u.id.toString()) }))
      result.pageInfo.hasNextPage = false
      result.pageInfo.hasPreviousPage = users.length > conn.last
      result.pageInfo.startCursor = result.edges[0]?.cursor || null
      result.pageInfo.endCursor = result.edges[result.edges.length - 1]?.cursor || null
    }

    return result
  }

graphql / graphql-relay-js

Pagination implementation for real-life database #94