sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.02k stars 978 forks source link

Implementation of Query Cache related to Schemas #890

Open leeparayno opened 7 years ago

leeparayno commented 7 years ago

From what I can glean from other issues and the Query_Cache.cpp code, ProxySQL implements the Query Cache as a number of hashes, for all the global queries sent through ProxySQL to the MySQL backends.

A number of issues with MySQL's implementation are related to it's global nature and all the blocking related to accessing the mutexes and various queuing operations for access to the MySQL Query Cache, prior to executing the query through the rest of the query execution steps.

This has been documented numerous times and called out by several notable MySQL experts.

Peter Zaitsev listed out some possible improvements in an old 2011 Percona Database Performance Blog post here:

https://www.percona.com/blog/2011/04/10/should-we-give-a-mysqlquery-cache-a-second-chance/

Increasing the size of the MySQL query cache, mostly because of the global mutexes and access requirements to maintain cache integrity reduce concurrency also documented:

https://www.percona.com/blog/2007/03/23/beware-large-query_cache-sizes/

One of the possible improvements to the accessibility of the Query Cache in ProxySQL in multiple schema environments, would be to create separate hash sets for each schema. Except in cases where queries cross multiple schemas, this would allow for a logical separation of Query Cache hash management down to each schema level, and would reduce any central contention on a central global cache for the entire ProxySQL instance.

It could be an extension of the configurability described here:

https://github.com/sysown/proxysql/issues/64

With the mysql_query_rules having a match_digest and a match_pattern, directing access to the query cache hash based on the schema name could be implemented.

This should probably be demarcated by a separate config variable, so that queries would not have to be parsed for multiple databases used in a single query (or if that is performed, that those queries would be cached in the global cache rather than schema specific hashes).

Lee

renecannao commented 7 years ago

Lee, thank you for the feedback. My first thought about this is that the documentation currently available about ProxySQL's Query Cache only describes what is and how to use: I should also document its internal.

I will summarize them here, and keep this issue as a reminder to write a more detailed document. The first important note is that the nature of ProxySQL's Query Cache is very different than MySQL's Query Cache. The second note is that the micro-benchmark I ran in the past never showed ProxySQL's Query Cache as a bottleneck. Maybe it is time to run new micro-benchmark and publish result!

For brevity:

Here some implementation details:

  1. PQC has already multiple caches, currently hardcoded to 32 caches. 32 is a random large number and not configurable as proposed in #64
  2. In MySQL Server it is possible that hundreds (or thousands) of threads (one-thread-per-connection) try to access MQC: this creates a lot of contention! In ProxySQL, the maximum number of threads that can try to access PQC is mysql-threads only (+1 , more details below)
  3. MQC invalidates resultset on execution of DML against the specified tablename: this complex mechanism, often the real cause of MQC being a bottleneck, doesn't exist in PQC, at all
  4. PQC invalidates resultsets on TTL, exactly as a KV storage like memcached does
  5. MQC performs the search based on the query, and the invalidation based on a list of tables; PQC performs the search only based on a key. PQC is a KV storage!
  6. when a thread accesses MQC, it does all the required housekeeping: invalidation, memory management, etc
  7. when a thread accesses PQC, the only housekeeping it can potentially perform is to mark the found resultset as expired and remove it from the hash table. It doesn't even bother to deallocate the resultset from memory!
  8. all the housekeeping is performed by a background thread, the Query Cache Purge Thread
  9. for all the reasons listed above, the computation performed by a MySQL Thread while accessing PQC is minimal. Because the computation performed inside PQC is extremely small compared to the computation performed outside PQC (for example, Query Processor is one of the most CPU intensive module), is it extremely unlikely that multiple MySQL Threads are trying to access the same PQC's hash/partition at the same time.

About the idea to create separate hash sets for each schema, I think it is a good idea for MQC. Because of how PQC is designed, I think that would be only an unnecessary complication.

leeparayno commented 7 years ago

I think a lot of the thrashing with MQC also occurs with the purging of queries when tables are updated.

That could also be a potential benefit if there could be some way of marking queries as invalid/dirty via an API, rather than relying strictly on TTL for cache validity.

A common thread among caching mechanisms is always freshness of the cache relative to the underlying data. While probably an unnecessary complication as the PQC is currently designed, I can see separate hash sets for each schema with possibly additional metadata would help bring ProxySQL to being able to purge/mark for purging any cache entries that are out of date with underlying data.

If this is worth investigating, possible making the caching implementation pluggable, it could make it easy to switch and compare implementations.

renecannao commented 7 years ago

Lee, do you have a specific case for cache invalidation? If yes, we can pursuit that specific case, instead of being generic.

I mean, adding metadata to resultset is possible, the question is which ones. Adding a schema and invalidate all the resultset for a schema is possible: if that's what you need, I can look into implementing it. I was also thinking on storing the rule_id of mysql_query_rules that triggers the caching, and to have an API to purge all the resultset from that specific rule_id. The possibilities are many, some complex some easy, but I would like to implement something that is both useful and required, not just theoretically possible.

leeparayno commented 7 years ago

One of my concerns is with our application we have a good amount of select queries that output data in a paginated format via web pages. This pagination of data is fairly expensive depending on the size of the underlying data set. For the core data, there is also separate edit operations that update individual records for any given data model.

Using ProxySQL's cache (PQC), via a TTL, for the length of the time of the TTL, data could potentially be returned to the client that is stale.

If there was a way to purge items from the cache, or mark them as such, by table, at the very least some mechanism could be built out to read from the backends' binlogs to initiate purge events.

MaxScale's data streaming appears to be able to read and convert binlog events to AVRO. Other projects like Maxwell and Debezeum could be the connective software, to help maintain an accurate cache.

renecannao commented 7 years ago

Thinking out loud a possible implementation. What about adding a numeric identifier to the cache entry, and having a command to purge all the entries with such id? For example, mysql_query_rules could have a new column cache_id , and would be possible to add a command like PURGE CACHE ID xyz .

leeparayno commented 7 years ago

Years ago, I worked with a Java project called EHcache that used a string ID for the cache key. This allowed us to have a hierarchical structure to cached items, based on our naming scheme:

producttype-product-model-modelnumber

Using this, we were able to purge whole sets of cached items, intelligently when different levels of cache purging were necessary.

This helps with the scenario often seen in MySQL's query cache where you might have a table with 10M records and then one single update to a "long tail" record that isn't as frequently hit, would invalidate all the cached queries for the entire table.

leeparayno commented 7 years ago

Extending from that, if you wanted to "seed" the id with database metadata like:

schema-table-userattributes1....

You would be able to build some additional scenarios for purging into the system.

renecannao commented 7 years ago

I like the idea! Thank you for the feedback!