sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
5.99k stars 974 forks source link

Parallel requests to backend on query cache expires #1486

Open konstantinosss opened 6 years ago

konstantinosss commented 6 years ago

Assume there are a lot of clients that query the same big table. Proxysql is configured to cache results from this table.

While result is cached, there are no queries to the backend server executed. But as soon as cached result is expired, proxysql makes a lot of select to the same table. Example protocol:

client1 queries "select * from t1", proxysql sends the query to backend
client2 queries "select * from t1", proxysql sends the query to backend
client1 gets result for the query, result is cached
client2 gets result for the query, result is cached

Such behavior leads to a sawtooth pattern in the backend load. It will be useful to have an option of how the cache will be populated. If such an option is set, the above example protocol would be like this:

client1 queries "select * from t1", proxysql sends the query to backend
client2 queries "select * from t1", proxysql waits for currently executing query
client1 gets result for the query, result is cached
client2 gets result from the cache
Version: 1.4.8-32-g669c149 
renecannao commented 6 years ago

I am not sure that waiting is the best approach here. ProxySQL cannot know how long the query will take to complete. Neither it can know if the backend where the query is currently running is way slower than another server: in that case, it will be more efficient to run queries in parallel instead of waiting/blocking.

Maybe a more efficient algorithm would be to configure QC to pretend that the entry is not in the cache for some of the requests, if executed close to the expiration time. For example:

In this way, some queries will be slower while the resultset is cached (because they will bypass QC), but no request will be blocked when the resultset expires.

konstantinosss commented 6 years ago

Other possible solutions:

  1. Update cache in the background by ProxySQL itself, while clients still get results from the cache
  2. Cache update is performed by 1 client connection, other clients get results from the cache. It can be configured by new option fast_cache_ttl <= cache_ttl
renecannao commented 6 years ago

I did think about the two other possible solutions you suggested, but they present an issue: ProxySQL cannot know if the backend where the query is currently running is way slower than another server. This is possible by many reasons, for example a specific server being overloaded. This is why I was suggesting to allow a certain percentage of traffic to bypass the QC, instead of just one connection. The parallelism will allows to run queries on multiple backends (assuming you have multiple backends). Of course, all thresholds need to be configurable.

konstantinosss commented 6 years ago

But solution 2 has no such limits. For example we have query rule that cache select result for 60 seconds. (cache_ttl=60 seconds). We can optionally configure the rule with fast_cache_ttl=40 seconds. That means all client selects will be taken from the cache during 60 seconds, but the cache entry can be updated from 40 to 60 seconds of TTL by alone client connection. If the cache entry will not be updated until 60 seconds pass by, cache entry will be expired and all client requests pass to backend server.

renecannao commented 6 years ago

I think we are saying almost the same thing, just slightly different implementation. You are suggesting a new field fast_cache_ttl that is lower than cache_ttl, while I am suggesting a percentage value. The implementation difference is that adding a new column in mysql_query_rules requires a lot of changes (in Query Processor, Admin, Cluster, and Query Cache), while adding a new global variable will result in changes mostly in Query Cache only. About the number of connections running the query, you are suggesting 1 while I was suggesting a percentage. If instead we define a concurrency value, a value of 1 will lead to the implementation you suggest, while a larger value will cause parallelism.

konstantinosss commented 6 years ago

Ok, you know better :) In any case rule's option is more flexible than global variable. For example, increasing mysql-threshold_resultset_size makes sens only for some query rules with caching, but for usual queries is more efficient to stream results. Hmm, may be another global variable is required - mysql-threshold_resultset_size_for_cache?

konstantinosss commented 6 years ago

MaxScale has such option - 'hard_ttl' )

krishmanoh commented 4 years ago

Is this a feasibility to implement? Percentage and concurrency?

anuragagarwal561994 commented 1 year ago

Hello @renecannao so I read the thread and realised that there are 2 different features being discussed here. I will state with some examples.

So the original question wants to have a soft ttl and a hard ttl as present in MaxScale. How this is useful is as follows, let's say I want to load some data or config from database at the start of my application. Now let's assume my data is present in some other data center and I would like to load the data very fast so that it doesn't impact the auto-scaling. In this case since I have an instance of proxy sql in every data center and it is caching the data, I can rely upon it. But when the cache ttl is expired there will be one or few boxes which will have to wait for this data to load again which will defeat the purpose.

If on the other hand I have let's say a hard ttl of 6 hours and a soft ttl of 5 hours then in every 5 hours a query will hit to the database even when there are no calls from the data consumers and refresh the query results with the latest data. This will always allow the cache to be available for the application and serve them without any pods behaving differently.

In this case I don't think there is any other way this can be done by not making even one pod suffer.

I guess even the QC way of doing things is also okay, but instead of giving them the latest data, they can be just used to trigger the queries and we should serve them with the cached results if possible.

I would love to discuss if we have more floating ideas and how to implement this, if you need a helping hand in contribution of this feature I would love to contribute. We are evaluating proxy SQL and find it to be a good solution for our use case, but with the limitation of soft ttl which we use in all kinds of our caches. In other use cases (not related to a database proxy) having a soft ttl has saved us many times from rippling upstream failures.

renecannao commented 1 year ago

With several code improvements in various modules (Admin , Cluster , etc) , perhaps the implementation of this is now a lot easier than in the past. Let me get back to this next week and evaluate the effort. Thanks

anuragagarwal561994 commented 1 year ago

Thanks @renecannao