Closed ajnavarro closed 5 years ago
Keeping track of how much memory each cache uses is potentially painful and hard (perhaps even costly). One way of inspecting the memory usage of a cache object would be using this method (https://stackoverflow.com/a/26123442) with reflection. But that may be expensive, specially if caches are big.
But, instead of keeping track of used memory, we could do something similar to what in memory joins do (specify a maximum memory for the application) an generalise that so that all caches can benefit from it.
go-mysql-server
MAX_MEMORY
environment variable and max_memory
session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.HasAvailableMemory
helper function to sql
package reporting whether there is any available memory, that is, the process does not take more than the max memory.func HasAvailableMemory(ctx *sql.Context) bool {
// needs to check `max_memory` session var, then `MAX_MEMORY` env var
// then compare it with the current memory usage
}
gitbase
When I say "use the new API" in all these components I'm referring to the following:
WDYT @src-d/data-processing?
A couple of things:
@ajnavarro we can pass the group by, sort, etc iterators a cleanup callback to erase everything in the cache when they're either closed or done
What kind of LRU do you want to move to GMS? We use a library for the LRU caches we use in gitbase
@erizocosmico we can expose that LRU library on a wrapper from go-mysql-server implementing the needed methods from memory handling, avoiding repeating that on every project using go-mysql-server. We can do that if we see the need in future iterations.
Could you specify the final API please? thanks!
MAX_MEMORY
environment variable and max_memory
session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.HasAvailableMemory
helper function to sql
package reporting whether there is any available memory, that is, the process does not take more than the max memory. This function will be publicly available, but ideally, clients only need to use the caches provided in the cache
package and not this directly.func HasAvailableMemory(ctx *sql.Context) bool {
// needs to check `max_memory` session var, then `MAX_MEMORY` env var
// then compare it with the current memory usage
}
cache
package that will contain all possible caches required by go-mysql-server and any other app extending from it.
cache.LRU
an LRU cache that will put items as long as there's available memory for it and never empty it when there's not anymore. By definition is not meant to hold all possible values that are added with Put.
type LRU struct {
// fields
}
func (l LRU) Put(x uint64) error { / code / } func (l LRU) Get(x uint64) (interface{}, error) { / code / }
- `cache.Rows` a cache of rows. When there is no more memory, it returns an error and empties the cache.
```go
type Rows struct {
// fields
}
func (c *Rows) Add(row sql.Row) error { /* code */ }
func (c *Rows) Get() ([]sql.Row) { /* code */ }
cache.History
a cache that preserves all the items set using Put. If there is no more memory available, it will fail and erase the cache.
type History struct {
// fields
}
func (h History) Put(x uint64) error { / code / } func (g History) Get(x uint64) (interface{}, error) { / code / }
- `cache.Hash`, which will hash whatever you pass and return an `uint64` to use as its key.
- Make in memory joins use `cache.Rows`.
- Make group by use `cache.Rows`.
- Make Sort use `cache.Rows`.
- Make Distinct use `cache.History`.
- Add docs about `MAX_MEMORY`.
### gitbase
- Upgrade go-mysql-server
- Make bblfsh and any other components use `cache.LRU`.
- Add docs about `MAX_MEMORY`.
WDYT @src-d/data-processing?
@erizocosmico really nice!
One last question, it's not clear for me how caches are requesting X amount of memory. I can see only checking if it is enough memory left, and if yes, they can get any amount of memory they need, or am I missing something? Maybe if that's the case, we should change HasAvailableMemory
to something like:
func HasAvailableMemory(ctx *sql.Context, request int) (obtained int) {
// needs to check `max_memory` session var, then `MAX_MEMORY` env var
// then compare it with the current memory usage
}
when requested is the amount of memory that the specific cache needs, and obtained is the amount of memory that the memory manager can give to that process. WDYT?
@ajnavarro if we want to do that, we need to measure how much memory it takes what we want to put into the cache (which might not be accurate, because we don't know what other operations the underlying cache implementation could be doing) using reflection.
The easiest approach is just:
Updated as recently discussed with the memory manager.
MAX_MEMORY
environment variable and max_memory
session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.memory
package that will contain all compoments required to manage memory.
HasAvailable
helper function reporting whether there is any available memory, that is, the process does not take more than the max memory. This function will be publicly available, but ideally, clients only need to use the caches provided by the memory manager.
func HasAvailable(ctx *sql.Context) bool {
// needs to check `max_memory` session var, then `MAX_MEMORY` env var
// then compare it with the current memory usage
}
Cache
interface common to all caches.
type Cache interface {
Dispose() error
}
Freeable
interface for caches that can be freed.
type Freeable interface {
Free() error
}
Manager
, which will be created in the SQL engine and passed down to all nodes from *sql.Context
.
// Manager is in charge of keeping track and managing all the components that operate
// in memory. There should only be one instance of a memory manager running at the
// same time in each process.
type Manager struct {
// fields
}
// DisposeFunc is a function to completely erase a cache and remove it from the manager. type DisposeFunc func() error
type KeyValueCache interface { Cache // Put a new value in the cache. If there is no more memory and the cache is // not Freeable it will try to free some memory from other freeable caches. // If there's still no more memory available, it will fail and erase all the // contents of the cache. // If it's Freeable, it will be freed, then the new value will be inserted. Put(uint64, interface{}) error // Get the value with the given key. Get(uint64) (interface{}, error) }
// RowsCache is a cache of rows. type RowsCache interface { Cache // Add a new row to the cache. If there is no memory available, it will try to // free some memory. If after that there is still no memory available, it // will return an error and erase all the content of the cache. Add(sql.Row) error // Get all rows. Get() []sql.Row }
// NewLRUCache returns an empty LRU cache and a function to dispose it when it's // no longer needed. func (m Manager) NewLRUCache() (KeyValueCache, DisposeFunc) { / impl / } // NewHistoryCache returns an empty history cache and a function to dispose it when it's // no longer needed. func (m Manager) NewHistoryCache() (KeyValueCache, DisposeFunc) { / impl / } // NewRowsCache returns an empty rows cache and a function to dispose it when it's // no longer needed. func (m Manager) NewRowsCache() (RowsCache, DisposeFunc) { / impl / } // Free the memory of all freeable caches. func (m Manager) Free() error
- `lruCache` an LRU cache that will put items as long as there's available memory for it and never empty it when there's not anymore. By definition is not meant to hold all possible values that are added with Put.
```go
type lruCache struct {
memory Freeable // in case it needs to free memory from the manager
// fields
}
func (l *lruCache) Put(x uint64) error { /* code */ }
func (l *lruCache) Get(x uint64) (interface{}, error) { /* code */ }
rowsCache
a cache of rows. When there is no more memory, it returns an error and empties the cache.
type rowsCache struct {
memory Freeable // in case it needs to free memory from the manager
// fields
}
func (c rowsCache) Add(row sql.Row) error { / code / } func (c rowsCache) Get() ([]sql.Row) { / code / }
- `historyCache` a cache that preserves all the items set using Put. If there is no more memory available, it will fail and erase the cache.
```go
type historyCache struct {
memory Freeable // in case it needs to free memory from the manager
// fields
}
func (h *historyCache) Put(x uint64) error { /* code */ }
func (g *historyCache) Get(x uint64) (interface{}, error) { /* code */ }
memory.CacheKey
, which will hash whatever you pass and return an uint64
to use as its key.MAX_MEMORY
.MAX_MEMORY
.WDYT @src-d/data-processing?
This has already been implemented and merged.
We have several ways to improve performance using different kinds of caches. Per example, we have git object cache, LRU cache for languages function, LRU cache for UASTs functions, on memory joins and so on.
The only one that is working depending on its size is git object cache. Some others are working with the number of elements, leading to some problems when memory is limited: src-d/gitbase#922
This issue is intended to propose a better solution for that, making possible a global total memory cache configuration, and depending on the use case, making possible a fine-grained usage of that memory for all the caches.