Explanation of DataLoader?

cody1024d commented 6 years ago

Hey @tonyghita, I'm trying to digest GoLang (with GraphQL), but the one part of this example I'm having a hard time with, is the flow of the loader.

I have a small pet-project running (innefficiently) using neelance's go-graphql library, but have yet to integrate dataloader (albeit, I know it's very necessary). Can you point me in the direction of a good tutorial, to help me get my head wrapped around how the loader works; specifically with the architecture you have in place?

My questions mostly surround how the interactions between the resolver layer and the loaders work. I see that the loaders are stored in the context, as a request-wide cache, but I guess I'm not seeing where the actual data-layer access is happening? Does that happen in each of the batch functions? I'm new to both the GraphQL environment and GoLang, so I do apologize for something that is a fairly basic question.

tonyghita commented 6 years ago

Great question! Dataloader is a fairly simple utility once you grok it, but it's dense topic. I'll ensure to explain fully, soon.

but I guess I'm not seeing where the actual data-layer access is happening? Does that happen in each of the batch functions?

Yup! The dataloader instances that live on the request context collect requests for data (batches), then on some interval (I think 16ms by default), executes the batch function. The result of execution is then available to anyone who has asked for data or will ask for the same data for the duration of the request.

cody1024d commented 6 years ago

Thanks @tonyghita, I appreciate the response. My last questions on the subject, I promise! I'm (slowly but surely) getting my head around the topic, I think!

Looking through the code, I'm seeing Resolvers call the NewX functions on the loaders, which in turn call to the loaders (which in turn call to the batch functions on that interval).

The one piece of the puzzle that looks out of place are the Prime functions in the loader files. Looking at Facebook's DataLoader documentation, the prime functions are there to essentially seed the cache with planets. I see the Resolver calling it (for example, from NewPlanets), but am unsure of why this is necessary? Is this essentially taking the results from a List-Resolver call, and making them available in case of a Singular-Resolver call (So for example a list of Planets is needed, and then somewhere else in the graphQL query, a singular planet is requested, that was already fetched through that list)?

And also, seemingly, you have a bit of abstraction occurring through the planetGetter and planetLoader structs?

tonyghita commented 6 years ago

Looking through the code, I'm seeing Resolvers call the NewX functions on the loaders, which in turn call to the loaders (which in turn call to the batch functions on that interval).

Yup, the New*() functions are used to validate any inputs and load data needed to resolve the type.

I see the Resolver calling [the loader Prime*() function] (for example, from NewPlanets), but am unsure of why this is necessary?

It's not strictly necessary, but I think it's a good practice to provide the data on the cache if you have it, just in case it's asked for again somewhere else in the query. The goal is to do expensive work (i.e. a network request) once in any given request.

This becomes more powerful when clients batch queries together (another layer of batching!) in a single HTTP request. Since the loaders have a cache that is shared throughout the lifecycle of the HTTP request, each query in the batch can benefit from that one cache.

And also, seemingly, you have a bit of abstraction occurring through the planetGetter and planetLoader structs?

Yeah, the thought it that this will make it easier to mock service call in the loader unit tests, since we only have to implement a single mock function, rather than every method the swapi client provides.

The benefit of this is each loader is restricted to knowing only the minimum it needs in order to load its own data.

An alternative implementation would pass the client instance to each loader instead of the interface.

Good questions!

cody1024d commented 6 years ago

Oh man, it finally just clicked @tonyghita. So your data-layer access actually happens through the getter interface, which happens (for example inside of the batch call in the planet loader file) here (by going through the planetLoader struct, to something* that implements the planetGetter interface:

for i, url := range urls {
        go func(i int, url dataloader.Key) {
            defer wg.Done()

            data, err := ldr.get.Planet(ctx, url.String())
            results[i] = &dataloader.Result{Data: data, Error: err}
        }(i, url)
    }

Am I on track with that? If so, is there an example of the planetGetter anywhere in this project, or that's left for implementation (obviously dependent on what persistence storage you're using). I just want to see the example end-to-end, for my own sake.

EDIT: Ahah! Found it, inside the swapi files:

func (c *Client) Planet(ctx context.Context, url string) (Planet, error) {
    // TODO: implement
    return Planet{}, nil
}

From an architecture perspective that totally makes sense, I think I was just battling the new syntax/semantics that I'm not used to, so I didn't grok it at first

@tonyghita I really appreciate you taking the time to make this project, and help me through it!

cody1024d commented 6 years ago

@tonyghita Ok, another question, although this isn't one directly covered by this example. Have you given any thought on how you would architect an interface/implementing-class relationship with resolvers and loaders and all?

Something like the Character->Human/Droid relationship in the golang example, or the classic Animal->Dog/Cat. Looking at the graphql-go (formerly from neelance), there's a resolver method:

func (r *Resolver) Character(args struct{ ID graphql.ID }) *characterResolver {
    if h := humanData[args.ID]; h != nil {
        return &characterResolver{&humanResolver{h}}
    }
    if d := droidData[args.ID]; d != nil {
        return &characterResolver{&droidResolver{d}}
    }
    return nil
}

This method, without the in-memory maps, I think, needs to determine the type of the character (calling a method in the loader package to do this, maybe?), and then calls an additional function, based on what type of character it is. My only concern with this, though, is I'm unsure of how to successfully batch the initial type-check call.

I hope I'm explaining my question well enough, let me know if it at all doesn't make sense. Thanks again for all the help, and if you're ever in the Dallas area, I owe you a beer at this point :)

tonyghita commented 6 years ago

This really depends on your backend implementation, but it sounds like you'll need:

some way of loading data to determine the type you need to resolve
depending on your implementation, maybe another fetch to get the actual data (if it wasn't part of 1).

cody1024d commented 6 years ago

Yeah that's kind of the workflow I had in mind. I guess I'm confused on how to leverage the DataLoader framework to make that first step happen, though. Would it simply be creating a new batch function, associated with a loader for, let's say, CharacterType, for the sake of the example? I guess what's got me is that its technically an attribute on an object, as opposed to an object itself.

Unless this all would happen in the loader functionality for the interface, so that it gets batched properly? Hmmmm

As of now I have the CharacterResolver function above calling to check the type, and then going through the right sub-type's loader , it seems like a deal breaker, almost as it would put me back into the scenario of querying for every object, if there's no way to batch this type check query

Note: The more I think about it, in my particular case, this is moot, as I'm going to just use a NOSQL db, and will store all Characters in the same table, and thus don't need separate loading methods, and all, and can just always pull from the same table, and then unmarshal (or marshal? forget the right nomenclature) into the correct type based on a type attribute on the Characters. Although if I were going with a relational DB, I think the above is still an interesting issue

cody1024d commented 6 years ago

Also @tonyghita sorry for the multiple questions, but on a second look at the batch function in this example, correct me if I'm wrong, but you're not actually "batching" the fetch of the objects? You're looping through the keys and fetching each one.

For your use-case, I'm assuming because that's what the swapi supports? However, for example, hitting a DB, looping over the keys inside the batch function is counter-productive to the problem DataLoader is trying to solve?

(The below is an example of the load batch that I'm talking about)

func (ldr StarshipLoader) loadBatch(ctx context.Context, urls dataloader.Keys) []*dataloader.Result {
    var (
        n       = len(urls)
        results = make([]*dataloader.Result, n)
        wg      sync.WaitGroup
    )

    wg.Add(n)

    for i, url := range urls {
        go func(i int, url dataloader.Key) {
            defer wg.Done()

            data, err := ldr.get.Starship(ctx, url.String())
            results[i] = &dataloader.Result{Data: data, Error: err}
        }(i, url)
    }

    wg.Wait()

    return results
}

pmrt commented 6 years ago

I know this thread is from several months ago but since @tonyghita seems pretty busy and there is no doc about dataloaders —which are quite hard to understand if you are new to GraphQL—. I think this could be helpful for newbies. This is how I got to understand the importance of dataloader.Prime() as well as how are dataloader connected to graphql. So, if anyone is struggling to understand the dataloaders and Prime(): read this.

Things I logged:

Requests to the SWAPI from client.go
When a key is added to the default cache of dataloader (dataloader/inMemoryCache_go19)
When dataloader.Prime() adds a key to the cache
When dataloader.Load() adds a key to the cache.

Results were very enlightening for me when I was struggling to understand how this graphql-dataloader-golang thing work. I tried removing PrimePeople from NewPeople():

func NewPeople(ctx context.Context, args NewPeopleArgs) (*[]*PersonResolver, error) {
    //err := loader.PrimePeople(ctx, args.Page)
    //if err != nil {
    //  return nil, err
    //}

    results, err := loader.LoadPeople(ctx, append(args.URLs, args.Page.URLs()...))
    if err != nil {
        return nil, err
    }

And I tested it with a really simple query:

{
    people{
           name
  }
}

And this was what i got:

without prime

What's happening on server-side?:

GraphQL resolves the people field (resolver/query.go:PersonQueryArgs:People).
PersonQueryArgs() will make a request to get the data from (https://swapi.co/api/people/?search=, since I didn't provide a name for people(name:<name>)). For me, with no GraphQL and Golang experience it was really hard to figure this one out.
The query resolver (query.go:PersonQueryArgs) will omit PrimePeople because I commented it out and it will call the Person resolver (resolver/person.go:NewPeople) which ultimately will call the dataloder.Load(), so the loader:
- Adds the key (endpoint url) to the cache, but with no value yet, because it is not resolved yet at this point, so it will save the thunk.
- Defers the execution of the batchFn (loader/person.go:loadBatch) which will actually make the request
- Repeats for each person endpoint (https://swapi.co/api/people/6.. 9, 8, 4 etc)
When 16ms has elapsed, or the dataloader buffer is full or whenever it decides it is enough for the current batch, it will execute the batchFn for the current batch, making a bunch of requests, (yeah @cody1024d this won't resolve the n+1 problem but just because we can't: we're retrieving the data from another API so there's a specific endpoint for each thing we want to request. Consequently, we need to make all these requests, if this were a DB we could just make a 'SELECT' with 'IN (1,2,3,4,etc)' for each ID but this is not the case).
Since we have now the value/values for each key the cache will resolve/finish the thunk and it will save the value in the cache.

This cache will be available during this request, so here we are NOT taking advantage of dataloader and its cache, because the graphql operation and the http request have finished.

So, what's happening on server-side without omitting the Prime method, that is, as it is now with the call to PrimePeople within resolver/person.go:NewPeople: ?

func NewPeople(ctx context.Context, args NewPeopleArgs) (*[]*PersonResolver, error) {
    err := loader.PrimePeople(ctx, args.Page)
    if err != nil {
        return nil, err
    }
        // [...]

This.

with prime

Exactly what @tonyghita has explained here:

It's not strictly necessary, but I think it's a good practice to provide the data on the cache if you have it, just in case it's asked for again somewhere else in the query. The goal is to do expensive work (i.e. a network request) once in any given request.

GraphQL resolves the people field (resolver/query.go:PersonQueryArgs:People).
PersonQueryArgs() will make a request to get the data from SWAPI.
The query resolver (query.go:PersonQueryArgs) will call the Person resolver (resolver/person.go:NewPeople), then NewPeople will call PrimePeople() with the results of (2) caching this results to the dataloader cache.
resolver/person.go:NewPeople will ultimately try to dataloader.Load() the given ids but the dataloader already have this info in the cache, so it will return it instead of deferring a batchFn execution.

So thanks to this PrimePeople which is an abstraction to dataloader.Prime(), we have saved 10 requests to the SWAPI.

That's how PrimePeople works and that's how GraphQL-Dataloader communication is structured in this project. Aside from caching, the Dataloader can't do much else for us in this project because we're retrieving data from a Rest API and even though we're batching things and we have several ids/urls/etc in a single batch... we still have to make a request per endpoint. If you were connecting to a database you would be able 'select' for multiple ids in a single query.

Also, I strongly recommend reading this post: https://medium.com/@gajus/using-dataloader-to-batch-requests-c345f4b23433. It's nodeJS but you get the idea of n+1 problem if you are a newbie to GraphQL.

Hope this help anyone, sorry for my non-native english!

tonyghita / graphql-go-example

Explanation of DataLoader? #3