Performance on large datasets with nested arrays

raysuelzer commented 5 years ago

Moving the conversation from #861 over here.

I modified the run.rb file slightly to:

Get rid of some fields on the FOOs to eliminate some noise
Add an array of BarTypes to foos to measure what happens when a dataset contains a nested array. I used an array of class objects instead of just an array of hashes, as this is closer to how the gem is used in a real world scenario.
Used the Call Stack printer for my results as it was easier for me wrap my head around.

Sorry for the gists...

Modified run.rb: https://gist.github.com/raysuelzer/684ec5fe911c215137c6c055d0fbcb05

Results: https://gist.github.com/raysuelzer/c826695174f886eb328af1ede9e5f266

Summary:

I am testing 1000 Foo objects which each have an array of the same three Bar class instances.

It looks like there is a large amount of time spent evaluating the Bar items. 1000 foos with 3 bars results in 3000 calls. And I am only getting two properties off the bars, but that is also generating 6000 calls. I'm not exactly sure what is going on in the code base, or if this even is where the time is being spent. But, it seems like a lot of overhead in CPU and memory being used to re-evaluating these items. Perhaps there is not any other way to do it, but my gut tells me there are gains to be had here. If the interpreter is evaluating the same item over and over again, it seems like this can be avoided assuming that the same item in memory and should always return the same result .

This might be a bad assumption, but maybe there is a way to tell the interperter to treat the objects being evaluated as immutable?

Again, I am sorry for my ignorance of the terminology used in this gem and in Ruby. I am going to continue to play with this for a while and hopefully have some better understanding of what is going on internally.

Thanks so much.

rmosolgo commented 5 years ago

Hi, thanks for the thoughtful writeup!

I think you raise an interesting point, something that has come up in my work as well. For example, if we query GitHub like this:

user(login: "dhh") {
  respositories(first: 50) {
    nodes {
      owner {
        login 
      }
    }
  }
}

We know that owner.login will always be dhh, but we re-evaluate it every time, just like the example you've given above.

So far, I have ignored this optimization because I didn't think (just a hunch) that it would have much real-world impact. But I think it would be possible to cache already-resolved fields and re-use those values as needed.

Can you share a bit about your real-world use case? I'd like to be convinced that the work is worth it!

raysuelzer commented 5 years ago

On mobile right now, so forgive typos.

Our use case is that we have a large number of items, each of which is associated to a list of "campaigns" on a campaign field. So for maybe 100 items, each might have a list of 15 campaigns associated with it. Most of the time these campaigns are going to be the same. So, the worst case scenario in terms of the GraphQL code doing extra work would be if all 100 items had the same 15 campaigns, it would resolve the campaigns 1500 times instead of 15 times.

Unfortunately, we use this pattern frequently in our app and it tends to be a lot more than 15 campaigns, so it quickly starts to slow down...

jturkel commented 5 years ago

We have a similar problem in the product content management domain. We have Products that have collections of PropertyValues (e.g. Brand = 'Sony', Color = 'Red', SKU = 'ABC123', Description = '...') since each of our customers can choose the "schema" for how they model there products. Here's a simplified version of our GraphQL schema and a query that illustrates the repetition of common objects:

type Property {
  id: ID!
  name: String!
  helpText: String
}

type PropertyValue {
  property: Property!
  value: String!
}

type Product {
  id: ID!
  thumbnailUrl: URL
  propertyValues: [PropertyValue!]!
}

query FindProducts {
  products(filter: "...") {
    nodes {
      id
      thumbnailUrl
      propertyValues {
        property {
          id
          name
          helpText
        }
        value
      }
    }
  }
}

Notice that the property information is repeated for each product in our list e.g. we retrieve the definition of the 'Brand' property multiple times. Sideloading this information worked well in our REST API since it meant we only serialized the information once. Unfortunately we get a lot of duplication in our GraphQL responses which takes awhile to compute and increases the payload size. At some point we want to investigate using a response encoding like graphql-crunch for reducing the payload size but it would be awesome to also have optimizations to make the evaluation of repeated objects more efficient.

rmosolgo commented 5 years ago

Thanks for sharing info about your use cases. I can see how some good caching would really help there!

I can imagine adding a kind of cache that says "for a given object, if it's rendered into an equivalent query fragment, re-use the already-rendered response". Some concerns here:

When caching is added for hotspots, it can actually slow down other areas of the code. So maybe the cache should be opt-in to avoid unexpected slowdowns on cache miss.
Similarly, if the cache is mismanaged, it could grow to a very big memory footprint. We should avoid surprising people with that.

With those concerns in mind, I can imagine something like:

A field option (or should it an object type option?) which turns on this cache for a field, eg:
```
field :property, ..., runtime_cache: true 
```
During the execution of a query, the runtime maintains a cache and uses it only for opted-in fields (or objects?)

What do you think of something like that?

jturkel commented 5 years ago

That sounds like an awesome improvement. A field level option would be great for optimizing well known access patterns but it might become unwieldy as the graph grows and the number of paths to reach the same object increases. Also supporting a type level option could make this more manageable. In DDD terms, we'd ideally enable this option on all GraphQL types that correspond to an aggregate so we wouldn't have to optimize individual paths through the graph. Hopefully the caching overhead would be minimal in those cases so we wouldn't have to further tune for only the hot paths between aggregates.

Calling the option runtime_cache might create some confusion as to the scope of the caching. Perhaps something more explicit like execution_scoped_cache would be less error prone? Alternatively if you see this as the beginnings of a more general purpose caching framework like Apollo has, then perhaps a cache_control option that can have values of :none (the default), :execution_scoped and someday a more advanced nested hash of options to support cross execution caching e.g. TTL, visibility scope, cache key, etc.

raysuelzer commented 5 years ago

This all sounds great to me. I guess it is safe to assume that the code reevaluating the same object multiple times is actually a slow down?

I also wonder about how the cache keys would be implemented. I assume it would be based upon the GraphQL query string along with the object being processed. I definitely have use cases where I will select different properties on the same object within the same query. For example, when getting a User object, with an updateBy field that is also a User object, many times this will be the same user. So, if the cache isn't looking at the fields selected, it would not only return more (or less) data than needed, it could result in infinite recursion.

getUser(id: "1") {
   id
   firstName
   lastName
   email
   updatedAt
   updatedByUser {
     id
     email
  }
}

I also favor doing the caching at an object level, for the reasons that @jturkel mentioned, but could be convinced otherwise.

jturkel commented 5 years ago

I did a quick and very dirty prototype that showed a 20% latency improvement for one of our queries that results in lots of duplication. The cache hit rate was nearly 40%!

raysuelzer commented 5 years ago

I did a quick and very dirty prototype that showed a 20% latency improvement for one of our queries that results in lots of duplication. The cache hit rate was nearly 40%!

@jturkel Do you have a sample of the code changes you made? This would be a very big win...

jturkel commented 5 years ago

The changes are here. It's definitely not production ready :) A few known issues:

I hard coded that types I wanted to be cached rather than making it configurable
The caching doesn't take selection arguments into account
The caching doesn't work if parts of the cached subtree are lazily evaluated

raysuelzer commented 5 years ago

The changes are here. It's definitely not production ready :) A few known issues:

I hard coded that types I wanted to be cached rather than making it configurable

The caching doesn't take selection arguments into account

The caching doesn't work if parts of the cached subtree are lazily evaluated

Thank you very much! Unfortunately, we can't use the new interperter because of the lack of error handling, hopefully that issue gets resolved so I can do something similar to what you did.

rmosolgo commented 5 years ago

New error handling for the interpreter 💁‍♂ https://graphql-ruby.org/errors/error_handling.html

Hope that helps! I don't have more work planned here, but if you come up with some runtime caching, I'd love to review a PR or take a look at an external project!

rmosolgo / graphql-ruby

Performance on large datasets with nested arrays #2352