Better memory management for caches

ajnavarro commented 5 years ago

We have several ways to improve performance using different kinds of caches. Per example, we have git object cache, LRU cache for languages function, LRU cache for UASTs functions, on memory joins and so on.

The only one that is working depending on its size is git object cache. Some others are working with the number of elements, leading to some problems when memory is limited: src-d/gitbase#922

This issue is intended to propose a better solution for that, making possible a global total memory cache configuration, and depending on the use case, making possible a fine-grained usage of that memory for all the caches.

erizocosmico commented 5 years ago

Proposal for go-mysql-server/gitbase memory management

Keeping track of how much memory each cache uses is potentially painful and hard (perhaps even costly). One way of inspecting the memory usage of a cache object would be using this method (https://stackoverflow.com/a/26123442) with reflection. But that may be expensive, specially if caches are big.

But, instead of keeping track of used memory, we could do something similar to what in memory joins do (specify a maximum memory for the application) an generalise that so that all caches can benefit from it.

Changes needed

go-mysql-server

Add MAX_MEMORY environment variable and max_memory session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.
Add a HasAvailableMemory helper function to sql package reporting whether there is any available memory, that is, the process does not take more than the max memory.

func HasAvailableMemory(ctx *sql.Context) bool {
  // needs to check `max_memory` session var, then `MAX_MEMORY` env var
  // then compare it with the current memory usage
}

Make in memory joins use the new helper function to determine when to use the cache.
Make GroupBy, Sort and Distinct use this. Ideally, they should fail if they don't have enough memory to perform their operations. Big GroupBy, Sort or Distinct operations can potentially OOM and we currently have no way of tracking that. With this change, they would just error saying there was no memory to perform these operations instead of OOM killing the server.
If there are any other caches in go-mysql-server, make them also use the new API.

gitbase

Upgrade go-mysql-server
Make sure all the caches use the new API. Since we're tracking total memory used by the process we take go-git caches into account for free.

When I say "use the new API" in all these components I'm referring to the following:

Each time a cache entry is going to be added, check if there is available memory to do it.
If there is no memory available, switch to a non-memory mode. If that switch is not possible, such as order by, distinct or group by, then just fail.
This non-memory mode can be either permanent or temporary:
- Permanent: if, for example, a join goes to non-memory mode the cache will be emptied and there's no way to fill that again even if more memory becomes available later on, so once switched to this mode, it stays in it. This is the case of joins.
- Temporary: it does not matter if the cache does not contain all possible values because for a certain amount of time there is no available memory. It can resume using memory if enough of it becomes available later on. This is the case of regexp caches, uast caches, etc that are just there for speed.

Benefits of this approach

Because we measure the process memory usage, we do not need any sync between the different caches. Everything allocated by the program is taken into account for free.
Ensures there is no OOM. Since we have a maximum amount of memory, caches cannot get higher than a certain point. Whereas having multiple sizes per each cache can amount to scenarios causing OOM errors.

Caveats

Session environment variable. Multiple sessions can be established, so it's possible for one session to have a different value from another. This may be something we want to avoid. If that's the case, we could only allow config via env var.

WDYT @src-d/data-processing?

ajnavarro commented 5 years ago

A couple of things:

I would move to go-mysql-server side some LRU implementation to use it from gitbase.
How can we free memory when a group-by finish, per example?
That memory config should be only managed by env var.

erizocosmico commented 5 years ago

@ajnavarro we can pass the group by, sort, etc iterators a cleanup callback to erase everything in the cache when they're either closed or done

What kind of LRU do you want to move to GMS? We use a library for the LRU caches we use in gitbase

ajnavarro commented 5 years ago

@erizocosmico we can expose that LRU library on a wrapper from go-mysql-server implementing the needed methods from memory handling, avoiding repeating that on every project using go-mysql-server. We can do that if we see the need in future iterations.

Could you specify the final API please? thanks!

erizocosmico commented 5 years ago

Proposal for go-mysql-server/gitbase memory management

go-mysql-server

Add MAX_MEMORY environment variable and max_memory session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.
Add a HasAvailableMemory helper function to sql package reporting whether there is any available memory, that is, the process does not take more than the max memory. This function will be publicly available, but ideally, clients only need to use the caches provided in the cache package and not this directly.

func HasAvailableMemory(ctx *sql.Context) bool {
  // needs to check `max_memory` session var, then `MAX_MEMORY` env var
  // then compare it with the current memory usage
}

Introduce a new cache package that will contain all possible caches required by go-mysql-server and any other app extending from it.
- cache.LRU an LRU cache that will put items as long as there's available memory for it and never empty it when there's not anymore. By definition is not meant to hold all possible values that are added with Put.
```
type LRU struct {
// fields
}
```

func (l LRU) Put(x uint64) error { / code / } func (l LRU) Get(x uint64) (interface{}, error) { / code / }

  - `cache.Rows` a cache of rows. When there is no more memory, it returns an error and empties the cache.
```go
type Rows struct {
  // fields
}

func (c *Rows) Add(row sql.Row) error { /* code */ }
func (c *Rows) Get() ([]sql.Row) { /* code */ }

cache.History a cache that preserves all the items set using Put. If there is no more memory available, it will fail and erase the cache.
```
type History struct {
// fields
}
```

func (h History) Put(x uint64) error { / code / } func (g History) Get(x uint64) (interface{}, error) { / code / }


  - `cache.Hash`, which will hash whatever you pass and return an `uint64` to use as its key.
- Make in memory joins use `cache.Rows`.
- Make group by use `cache.Rows`.
- Make Sort use `cache.Rows`.
- Make Distinct use `cache.History`.
- Add docs about `MAX_MEMORY`.

### gitbase

- Upgrade go-mysql-server
- Make bblfsh and any other components use `cache.LRU`.
- Add docs about `MAX_MEMORY`.

WDYT @src-d/data-processing?

ajnavarro commented 5 years ago

@erizocosmico really nice!

One last question, it's not clear for me how caches are requesting X amount of memory. I can see only checking if it is enough memory left, and if yes, they can get any amount of memory they need, or am I missing something? Maybe if that's the case, we should change HasAvailableMemory to something like:

func HasAvailableMemory(ctx *sql.Context, request int) (obtained int) {
  // needs to check `max_memory` session var, then `MAX_MEMORY` env var
  // then compare it with the current memory usage
}

when requested is the amount of memory that the specific cache needs, and obtained is the amount of memory that the memory manager can give to that process. WDYT?

erizocosmico commented 5 years ago

@ajnavarro if we want to do that, we need to measure how much memory it takes what we want to put into the cache (which might not be accurate, because we don't know what other operations the underlying cache implementation could be doing) using reflection.

The easiest approach is just:

Is the memory limit reach? Then no memory available.
Is it not reached? then put stuff in it, even if the thing doesn't fully fit in the memory limit. Ideally, things put in the cache don't take that much space, so a little margin there wouldn't be noticeable. When setting the max memory amount, you will have to leave some margin for other processes in the OS and so on, so I don't think this will become a problem.

erizocosmico commented 5 years ago

Updated as recently discussed with the memory manager.

Proposal for go-mysql-server/gitbase memory management

go-mysql-server

Add MAX_MEMORY environment variable and max_memory session variable, which could. This needs to have a sensible default that's not too low or too high. Has to be enough for the in memory joins and all other things using caches.
Introduce a new memory package that will contain all compoments required to manage memory.
- Add a HasAvailable helper function reporting whether there is any available memory, that is, the process does not take more than the max memory. This function will be publicly available, but ideally, clients only need to use the caches provided by the memory manager.
```
func HasAvailable(ctx *sql.Context) bool {
// needs to check `max_memory` session var, then `MAX_MEMORY` env var
// then compare it with the current memory usage
}
```
- Cache interface common to all caches.
```
type Cache interface {
Dispose() error
}
```
- Freeable interface for caches that can be freed.
```
type Freeable interface {
Free() error
}
```
- Manager, which will be created in the SQL engine and passed down to all nodes from *sql.Context.
```
// Manager is in charge of keeping track and managing all the components that operate
// in memory. There should only be one instance of a memory manager running at the
// same time in each process.
type Manager struct {
// fields
}
```

// DisposeFunc is a function to completely erase a cache and remove it from the manager. type DisposeFunc func() error

type KeyValueCache interface { Cache // Put a new value in the cache. If there is no more memory and the cache is // not Freeable it will try to free some memory from other freeable caches. // If there's still no more memory available, it will fail and erase all the // contents of the cache. // If it's Freeable, it will be freed, then the new value will be inserted. Put(uint64, interface{}) error // Get the value with the given key. Get(uint64) (interface{}, error) }

// RowsCache is a cache of rows. type RowsCache interface { Cache // Add a new row to the cache. If there is no memory available, it will try to // free some memory. If after that there is still no memory available, it // will return an error and erase all the content of the cache. Add(sql.Row) error // Get all rows. Get() []sql.Row }

// NewLRUCache returns an empty LRU cache and a function to dispose it when it's // no longer needed. func (m Manager) NewLRUCache() (KeyValueCache, DisposeFunc) { / impl / } // NewHistoryCache returns an empty history cache and a function to dispose it when it's // no longer needed. func (m Manager) NewHistoryCache() (KeyValueCache, DisposeFunc) { / impl / } // NewRowsCache returns an empty rows cache and a function to dispose it when it's // no longer needed. func (m Manager) NewRowsCache() (RowsCache, DisposeFunc) { / impl / } // Free the memory of all freeable caches. func (m Manager) Free() error

  - `lruCache` an LRU cache that will put items as long as there's available memory for it and never empty it when there's not anymore. By definition is not meant to hold all possible values that are added with Put.
```go
type lruCache struct {
  memory Freeable // in case it needs to free memory from the manager
  // fields
}

func (l *lruCache) Put(x uint64) error { /* code */ }
func (l *lruCache) Get(x uint64) (interface{}, error) { /* code */ }

rowsCache a cache of rows. When there is no more memory, it returns an error and empties the cache.


type rowsCache struct {
memory Freeable // in case it needs to free memory from the manager
// fields
}

func (c rowsCache) Add(row sql.Row) error { / code / } func (c rowsCache) Get() ([]sql.Row) { / code / }

  - `historyCache` a cache that preserves all the items set using Put. If there is no more memory available, it will fail and erase the cache.
```go
type historyCache struct {
  memory Freeable // in case it needs to free memory from the manager
  // fields
}

func (h *historyCache) Put(x uint64) error { /* code */ }
func (g *historyCache) Get(x uint64) (interface{}, error) { /* code */ }

memory.CacheKey, which will hash whatever you pass and return an uint64 to use as its key.
Because all the different caches should never be used if they're not created from the manager, they will be private and the only wait to use them will be through the provided interfaces and methods in the manager.
- Make in memory joins use a rows cache from the memory manager.
- Make group by use a rows cache from the memory manager.
- Make Sort use a rows cache from the memory manager.
- Make Distinct use a history cache from the memory manager.
- Add docs about MAX_MEMORY.

gitbase

Upgrade go-mysql-server
Make bblfsh and any other components use a lru cache from the memory manager.
Add docs about MAX_MEMORY.

WDYT @src-d/data-processing?

erizocosmico commented 5 years ago

This has already been implemented and merged.

src-d / gitbase