Open leeparayno opened 7 years ago
Lee, thank you for the feedback. My first thought about this is that the documentation currently available about ProxySQL's Query Cache only describes what is and how to use: I should also document its internal.
I will summarize them here, and keep this issue as a reminder to write a more detailed document. The first important note is that the nature of ProxySQL's Query Cache is very different than MySQL's Query Cache. The second note is that the micro-benchmark I ran in the past never showed ProxySQL's Query Cache as a bottleneck. Maybe it is time to run new micro-benchmark and publish result!
For brevity:
Here some implementation details:
mysql-threads
only (+1 , more details below)About the idea to create separate hash sets for each schema, I think it is a good idea for MQC. Because of how PQC is designed, I think that would be only an unnecessary complication.
I think a lot of the thrashing with MQC also occurs with the purging of queries when tables are updated.
That could also be a potential benefit if there could be some way of marking queries as invalid/dirty via an API, rather than relying strictly on TTL for cache validity.
A common thread among caching mechanisms is always freshness of the cache relative to the underlying data. While probably an unnecessary complication as the PQC is currently designed, I can see separate hash sets for each schema with possibly additional metadata would help bring ProxySQL to being able to purge/mark for purging any cache entries that are out of date with underlying data.
If this is worth investigating, possible making the caching implementation pluggable, it could make it easy to switch and compare implementations.
Lee, do you have a specific case for cache invalidation? If yes, we can pursuit that specific case, instead of being generic.
I mean, adding metadata to resultset is possible, the question is which ones.
Adding a schema and invalidate all the resultset for a schema is possible: if that's what you need, I can look into implementing it.
I was also thinking on storing the rule_id
of mysql_query_rules
that triggers the caching, and to have an API to purge all the resultset from that specific rule_id
.
The possibilities are many, some complex some easy, but I would like to implement something that is both useful and required, not just theoretically possible.
One of my concerns is with our application we have a good amount of select queries that output data in a paginated format via web pages. This pagination of data is fairly expensive depending on the size of the underlying data set. For the core data, there is also separate edit operations that update individual records for any given data model.
Using ProxySQL's cache (PQC), via a TTL, for the length of the time of the TTL, data could potentially be returned to the client that is stale.
If there was a way to purge items from the cache, or mark them as such, by table, at the very least some mechanism could be built out to read from the backends' binlogs to initiate purge events.
MaxScale's data streaming appears to be able to read and convert binlog events to AVRO. Other projects like Maxwell and Debezeum could be the connective software, to help maintain an accurate cache.
Thinking out loud a possible implementation.
What about adding a numeric identifier to the cache entry, and having a command to purge all the entries with such id?
For example, mysql_query_rules
could have a new column cache_id
, and would be possible to add a command like PURGE CACHE ID xyz
.
Years ago, I worked with a Java project called EHcache that used a string ID for the cache key. This allowed us to have a hierarchical structure to cached items, based on our naming scheme:
producttype-product-model-modelnumber
Using this, we were able to purge whole sets of cached items, intelligently when different levels of cache purging were necessary.
This helps with the scenario often seen in MySQL's query cache where you might have a table with 10M records and then one single update to a "long tail" record that isn't as frequently hit, would invalidate all the cached queries for the entire table.
Extending from that, if you wanted to "seed" the id with database metadata like:
schema-table-userattributes1....
You would be able to build some additional scenarios for purging into the system.
I like the idea! Thank you for the feedback!
From what I can glean from other issues and the Query_Cache.cpp code, ProxySQL implements the Query Cache as a number of hashes, for all the global queries sent through ProxySQL to the MySQL backends.
A number of issues with MySQL's implementation are related to it's global nature and all the blocking related to accessing the mutexes and various queuing operations for access to the MySQL Query Cache, prior to executing the query through the rest of the query execution steps.
This has been documented numerous times and called out by several notable MySQL experts.
Peter Zaitsev listed out some possible improvements in an old 2011 Percona Database Performance Blog post here:
https://www.percona.com/blog/2011/04/10/should-we-give-a-mysqlquery-cache-a-second-chance/
Increasing the size of the MySQL query cache, mostly because of the global mutexes and access requirements to maintain cache integrity reduce concurrency also documented:
https://www.percona.com/blog/2007/03/23/beware-large-query_cache-sizes/
One of the possible improvements to the accessibility of the Query Cache in ProxySQL in multiple schema environments, would be to create separate hash sets for each schema. Except in cases where queries cross multiple schemas, this would allow for a logical separation of Query Cache hash management down to each schema level, and would reduce any central contention on a central global cache for the entire ProxySQL instance.
It could be an extension of the configurability described here:
https://github.com/sysown/proxysql/issues/64
With the mysql_query_rules having a match_digest and a match_pattern, directing access to the query cache hash based on the schema name could be implemented.
This should probably be demarcated by a separate config variable, so that queries would not have to be parsed for multiple databases used in a single query (or if that is performed, that those queries would be cached in the global cache rather than schema specific hashes).
Lee