urql-graphql / urql

The highly customizable and versatile GraphQL client with which you add on features like normalized caching as you grow.
https://urql.dev/goto/docs
MIT License
8.64k stars 452 forks source link

graphcache poor performance for 1MB of data #2548

Closed sarink closed 2 years ago

sarink commented 2 years ago

Switching pages in our mobile app takes around 400ms. Through profiling, it was revealed that the bulk of this time is spent in useQuery. Note that this is with a completely warm cache, no network requests are fired.

I was in a little bit of disbelief, so I performed a separate test where I converted const { data } = useQuery(); to const data = require('./data.json'); This resulted in the view being rendered in around 100ms.

Now, using the chrome profiler instead of the react profiler, I have discovered that it is indeed urql's graphcache which takes 300ms to resolve a local query from the cache:

screen_shot_2022-07-20_at_5 17 27_pm

Profile-20220720T172604.json.zip

Any help would be appreciated

ivan commented 2 years ago

Include a screenshot or a reproduction to illustrate the bug and how to reproduce it.

Please also leave a note on what version of urql you're using and include a list of exchanges (if any).

\<!-- Your issue will be addressed much faster if you include a reproduction. Here are some templates to help you create one:

JoviDeCroock commented 2 years ago

Hmm we would probably need a bit more information around the query as most of our benchmarks indicate to us being significantly faster 😅 I can only encourage to follow the above advise or at least helping us help you. It looks like a lot of time is spent in wonka and resolving links so this must be an immensely deep query.

JoviDeCroock commented 2 years ago

Something that stands out to me is that you aren't running your application in production which puts graphCache, react and urql in a slower mode as it will dispatch errors and try to detect misuse. In production getField/isAvailable/... won't be called so.... 😅

sarink commented 2 years ago

That is an interesting point about wonka and resolving links. This query in particular is quite deep, and also is a combination of several fragments. I decided to write the query out on its own, not using any fragments, and that cut the time in half. Does this make sense to you?

The issue originally surfaced because it "feels" slow, even in prod. I'm not quite sure how I could build a sandbox for it, I thought uploading the profiling export might give a hint at where I could start debugging.

I was also wondering if it's possible to memoize this? Urql should know that nothing in the cache has changed, and the reference to the parsed graphql query document is stable, so why does this need to rerun in the first place?

Thanks for your fast response!

kitten commented 2 years ago

There are some automatic optimisers that can be applied with GraphQL Code Generator afaik, but that's a bit besides the point. The bigger question is howuch data you're loading at a time, because if I read this right even static JSON is only a 4x speedup.

So, it's a bit of a question around what's "noticeable" to you 😅 100ms is still noticeable. Suppose in production it shrinks to 20ms vs 80ms, I'd still call both a little questionable. Hence, I'm wondering how much data it is, since I'm actually quite happy with being only 4x slower than static JSON, if the numbers check out, heh 😅

So, ultimately, I haven't looked into the profile in depth yet, but seeing the screenshot getNode and getField are linked, but all they do is reach into the data structure. Meanwhile, readSelection is what scans the query. There's a theoretical speed gain here, but it's costly in terms of development time and bundle size ultimately, if it's possible. If we suppose though that we didn't have to pay this cost we'd double the speed. That'd bring you to a 150ms overhead in development. Still not good enough. Suppose we'd find a 10x speedup (which is a bit unlikely as previously we deemed the "dependencies" tracking the bottleneck that could still be optimised which has like an immeasurable impact here). Even with 10x we're above a "frame time" i.e. 30ms. So hence the question is whether that'll hold or whether with even more data it'd become unacceptable again.

In other words, I'm approaching this with a bit of scepticism. My advice would be to get a profile in production. It's harder to trace but if we can compare those times it'll give us more data.

If this is caused by pagination then our advice is to not let that become "unbpundedly long" and maybe split data up into pages at a component level (we've got simple code samples for that)

Other than that though, the disclaimer is: I'm not sure how much faster we can go. We tried several different data structures and this is the fastest by a lot. So, unless we have a breakthrough I can only see at the very most a 2x speedup in the future 😇 likely however more like 1.2x. Hard to tell without knowing where to look

kitten commented 2 years ago

forgot to respond to this ✌️

I was also wondering if it's possible to memoize this? Urql should know that nothing in the cache has changed, and the reference to the parsed graphql query document is stable, so why does this need to rerun in the first place?

It partially is, but not as often and in the way one would expect 😅 There are multiple layers of memoisation actually; in fact, in theory you could even "stack" cache exchanges in urql. The layers as you probably have them are:

However, as per the normal question of: why isn't ever result memoised? The answer is because it hides issues like these, defeats benchmarks, and limits use-cases where stale results aren't acceptable or changes go undetected.

In the broadest sense, Graphcache keeps track of whenever a query has been potentially affected by a change. This is currently accurate, but we're also hypothetically thinking of being able to make this less accurate. The worst cost you can pay is for your query to be reconstructed from the normalised cache. That cost has to be payed at some point, at the very least:

Even if we memoised this (and you can basically do this as well with React.memo and co to an extent of course) you'd pay this price eventually. Usually, this is supposed to be a trivial cost, i.e. runnable hundreds of times a second. However, if it runs at slow speeds like these, it's simply an amount of data that will take a long time to store, retrieve, and render, no matter how we turn it 😅

The only two things we can do to fix it are:

sarink commented 2 years ago

Hence, I'm wondering how much data it is, since I'm actually quite happy with being only 4x slower than static JSON, if the numbers check out, heh

The speedup is a lot more than 4x with a static json file, because in that case, there is zero time spent in graphcache. The remaining time I was referring to is how long it takes the react components to render the data in the UI (which is the original problem I set out to solve, or so I thought, until I realized that 75% of the rendering time is actually graphcache retrieving the data, I'm very confident in being able to reduce the rendering time, but unless we also reduce graphcache's time, it's pointless).

You are correct that it is a fair bit of data (155kb over the network, which is just under 2MB uncompressed). However, if this were not going through graphcache and we were just passing this data around as a prop, it wouldn't be a problem.

We've constructed our app in a way where most components are "smart" and prefer to query data rather than have it be passed in as a prop (even if the data is almost certainly already available). Is this unwise? It appears that there can be significant overhead with retrieving data from the normalized cache, an overhead that would simply not exist if the already-available data was just passed in as a prop. We have operated under the assumption that calls to useQuery would effectively be a no-op if the data was already cached.

The only two things we can do to fix it are: speed it up by another order of magnitude avoid querying this much data at a single point in time in the first place

If we did succeed in speeding it up an order of magnitude, 300ms to read the cached data became 30ms, I am confident that the remaining time (~100ms, which is react rendering the UI) can also be sped up (perhaps another order of magnitude with some useMemos, bringing the total time to 40ms), this would be acceptable.

That cost has to be payed at some point, at the very least: When the result is first fetched; or, When the result has potentially changed

I completely agree, and am totally fine paying that cost when the result is first fetched or when the result has changed. But continuing to pay that cost on every subsequent render (with 100% cache-hits), is a bit of a surprise. We can easily solve the "my react component takes 100ms to render" problem, what we don't know how to solve is the "graphcache takes 300ms to retrieve data that's already in memory" problem. Is some kind of caching or memoization possible?

kitten commented 2 years ago

Just to clarify 😅 it's not a cost you incur when rendering, but most likely when mounting, i.e. when your query becomes active. The Architecture docs page has a longer explanation on that.

The point isn't that an order of magnitude speed up would help you, rather, the point was that while it might help you, right now, very easily, it won't in the future. You could always query a larger payload. If it's 10x more complex we're back where we started.

We've benchmarked Graphcache to be fast, but what I was trying to convey is that there's a limit to how fast these things can be 🫠 Trust me, we checked thoroughly.

Personally, my approach and advise is to actually compose fragments into one query per "initial mount". However, that's a little besides the point, since I'll need some info on the structure of the data you're asking for. Ignoring the fragments, how deep is it, does it use skip/include, how many objects is it querying in total once it's run, how many properties are we dealing with, is there embedded data, etc.

It's not unlikely we're dealing with a deopt in V8, it could also just be the amount of data, but we'll have to get to a point where we can start comparing it to our benchmarks, adapt the numbers from the benchmarks, then see whether our synthetic times match your real world times, and if not why they wouldn't.

Hope that makes sense ✌️

frederikhors commented 2 years ago

@sarink can you please post a reproduction? So we can inspect the issue and help find a solution concretely.

kitten commented 2 years ago

@frederikhors A reproduction doesn't necessarily apply here, hence I've asked more specific questions around the characteristics of the data so we can compare that with synthetic benchmarks. Please don't triage on our behalf when an issue is already being triaged, cheers!

kitten commented 2 years ago

@sarink: I just took a look at the trace, but it isn't making much sense to me. Most timings seem to indeed originate from the schema predicates, which again, aren't present in production mode. In other places, I see that readSelection's self-time is high without the trace giving me the granularity I'd need to understand why. But again, the trace unfortunately also isn't worth much in development mode.

The schema itself can be optimised by the way. We've got @urql/introspection for that, which also has an integration with GraphQL Code Generator. Apart from that, we could be dealing with a deopt here, or a long list, or something else, but it's impossible to tell, so again, that leaves me with the following questions 😅

I also noticed that you could avoid using maskTypename here. I don't know how fast it'd be in an optimised build, but it is giving me an indication that it's taking a bit longer than we'd like. Optimally, you wouldn't need to use this, I suppose.

sarink commented 2 years ago

I've made a sample repo which demonstrates the issue in an environment as close to ours as I can reasonably replicate: https://github.com/sarink/urql-perf

I should also have noted that I've been running with a 4x cpu slowdown to simulate a mid-tier mobile, and on desktop the numbers we see are much better (though I would still consider them slow).

Thank you for your time and help! I look forward to getting to the bottom of this 🕵️‍♂️

Update: After reading your most recent suggestion and changing the client config, I was able to cut the time in half, which is a nice improvement, but still far from where I'd like it to be (devtools was understandably the biggest slowdown, we should have removed that before profiling in the first place):

base: 210ms no masktypename: 205ms (-5ms diff) no schema: 180ms (-30ms diff) no devtools: 120ms (-90ms diff)

all together, new base: 100ms and in a production build with the above disabled, I'm seeing 60ms

Considering that this is only reading already-loaded data from the cache, I think it should be more like, < 5ms. It also always renders twice (excluding the loading state), any idea why?

kitten commented 2 years ago

I'm frankly really running out of ways to explain that a speed up of normalised data can't be conjured from nothing. The assumption that "it should be more like <5ms" does simply not apply. You're asking a normalised cache to reconstruct 1MB of data with thousands of items. The 60ms of production performance is simply a hard limit.

I actually took a look and no matter how I turn it, this is the speed limit. It's already 10x faster than Apollo's in memory cache (although, I haven't tested it against a newer version in a while). All that to say, it's fast.

We tested various data structures and changes and this is the fastest. In fact, today I tried again for an hour to replace some, and that yielded no measurable change in performance. No speed up, no slow down. No matter which hot path I touch.

So again, even if we find a way to speed this up in the future, this will likely not help the worst-case performance by more than a few percentage point. There's a limit to how fast JS engines / V8 can go. There's a limit to how much GC work can be avoided. And so on..

I have several pages of material though that can help you understand the internals better:

So, in short, if this was 'How To Become A Millionaire' and you're on the seat, take the 60ms and cash out. I'm personally amazed that we can reconstruct 4000 x 2 entries in just 60ms.

We can try to speed some of this up, but ultimately there are hard limits. We've put in an immense amount of time that likely makes Graphcache the fastest normalised cache for GraphQL. So, to set expectations again, and I stress: don't plan on another 10x speedup. I don't like spelling this out plainly but it's almost 10x faster than Apollo's cache already.

So to address questions one by one:

Can't this be memoised? No, it cannot be memoised. I explained this above. But to put it simply, there's a limited amount of queries (typically, let's say, just one) on screen. At any point in time this may change. When this query is affected by a normalised cache change, it requeries. It can't tell what changes, it has to scan the cache for that. Can't that be optimised to only look at what changed? Not really. Because of resolvers and other factors that has different drawbacks. And it wouldn't really be much faster because it still doesn't change the worst case performance.

What about the double render? That's a React quirk basically. The React team is in short fighting tearing (when data that's shown isn't what it should be or is in other parts of the app). As part of that, this means that we have to get our initial state from the cache, then subscribe in a separate effect to ask for the query result again (which by that point could've changed).

Could we avoid more work by tracking granularly what changed? Maybe. We've found so far that it's cheaper to scan the cache and check the cache for any changes as they happen.


To address the elephant in the room here though, I can't see a reason why you're loading this much data. Are you actually displaying every single piece of information in your app? Are you basically trying to display 3,000 rows to the user in a table, for example? Because if not there is an alternative here.

GraphQL is a great abstraction for an app to ask for the data it needs. That's often misinterpreted as "only the fields it needs", however, more importantly, if say, you're using the rows to calculate a value or statistic, that's work that I'd let the API do and then query that instead.

In other words, I'd avoid at all cost to load 1MB of raw data in the browser in the first place. However, if you must do that, avoid normalising the data too far. For instance, say you have a nested object that never needs to be normalised (i.e. it has an id but you never query it in two places, or it never is affected by mutations) then you can tell Graphcache to not normalise it, e.g.:

keys: {
  Image: () => null
}

This can save you from a lot of unnecessary work since the data becomes "embedded". Potentially, you can even turn some of your data into scalars without selection sets. I've for instance done that in the past to encode geolocations as a raw { lat, lng } object without having it be a GraphQL Object type.

I hope you can take this with a bit of humour and a grain of salt, since the situation is quite complex and confusing. But to sum it up again in some key points:

Hope that helps. Full transparency, I'm closing this without a resolution. We've spent countless hours on optimising performance and have seen great results. But there's a limit and while I spent another hour or two on that today, that limit applies to how much of this problem is basically a white whale. We may incrementally see small improvements of performance here and there, but we won't just find 10x in a dusty corner. Even 2x seems unlikely at a time, no matter how much data or approaches we throw at this.

Edit: btw, I'm also updating the title to put it a bit into perspective, since it really draws attention to a problem that a lot of people simply won't have 😅

sarink commented 2 years ago

First of all, thank you for all of this info. It's really helped to understand the problem better 🙇

When this query is affected by a normalised cache change, it requeries.

If no queries have been executed that resulted in a network request, and no mutations executed at all, then why can't a query be memoized? In this scenario, isn't graphcache 100% confident that no changes have been made to the cache, therefore, no query could have possibly been affected, and a requery should not be necessary... right? I can imagine a naive approach of keeping a dictionary of every query's results, and invalidating the whole thing when any change is made in the graphcache (any network request or any mutation)

What about the double render? That's a React quirk basically

Every component that uses useQuery absolutely must render twice? And since references to the data also change, all of its children (that ingest data) will render twice as well. Is there no way around this? If I'm understanding your description of the problem correctly, I'm not sure what could be done about it, but, this seems significant to me. How can everyone accept that all their "pages" (generally where most queries are, somewhere high up in the tree) will render all (most) components twice?

To address the elephant in the room here though, I can't see a reason why you're loading this much data.

We are loading this much data because our app is intended to work 100% offline. Therefore, we decided to make such actions (like searching through 3k items, computing totals, or, yes, displaying them all in a table), just work client side. We could construct new queries that return subsets of the data? but the total size of the cache will not change, and for the offline experience we'd want to be able to do this all without the API anyway (which I believe means that reconstructing any query in a cache of this size will have a meaningful cost associated with it, and there's really nothing we can do). With that being said, do you have any suggestions on an alternative approach?

Thanks again!

kitten commented 2 years ago

If no queries have been executed that resulted in a network request, and no mutations executed at all, then why can't a query be memoized?

That's a trickier question than it may seem. We simply cannot know whether something has changed. We don't exclude the possibility of sideeffects in our exchange pipeline. We recommend that if someone was to write their own cache that they call reexecuteOperation on all affected queries, but we don't assume that they will actually do that.

That said, I reckon, we could probably see if there's a way to cache for the React bindings for longer. The problem with those is that we'd then intentionally create a memory leak. This is then a problem if people have Node use-cases and aren't expecting this behaviour. We have a long list of these kind of use-cases which are tricky to navigate. Generally though, as I explained, memoisation doesn't help you when you're already hitting performance problems. They help to keep you further away from performance problem in targeted worst case scenarios.

And since references to the data also change, all of its children (that ingest data) will render twice as well.

No, the references from Graphcache remain stable.

How can everyone accept that all their "pages" (generally where most queries are, somewhere high up in the tree) will render all (most) components twice?

Because the React team's mantra is to just let renders happen. And I agree with that for the most part. You shouldn't prevent renders at the highest points of your app, presuming that you actually take care to hoist queries. Instead, it's much easier to disrupt renders, e.g. with React.memo wrappers around components that you know are expensive to rerender. React Concurrent really plays into this as an advantage.

We are loading this much data because our app is intended to work 100% offline

This may be a bold suggestion. But if you really need all of this functionality offline then you may be treating the GraphQL schema more as a delivery mechanism rather than an actual schema. This may not be playing into GraphQL's strengths. There isn't a lot of prior art around this, but at that point Graphcache's offline code obviously does a lot of the heavy lifting with little time investment, but if you're loading what's essentially a database into the browser, you may want to consider offloading this into a service worker.

I've heard of people embedding a layered GraphQL schema into a service worker that queries the real API and go more granular in their apps. I haven't played around with this idea myself yet though.

but the total size of the cache will not change

It's important to note that the total size of the cache isn't the most important aspect here. It's the total payload size that's queried.

That's a huuuuge differentiation to make, because it allows you to potentially circumvent the issue by separating querying data you'll need offline from queries you'll actually make for your UI.

I reckon, if we had an API that allowed you to enumerate entities by type (and we unfortunately don't have that just yet) then this could maybe be a good compromise to improve performance by querying less at a time.

That's because generally writing to the cache is faster and if you're querying this much data, I expect that you'd also be rather careful around when you query it, since it's offline-first.