Open butonic opened 1 month ago
There are some things I've noticed around the GetTags
request that could be causing a lot of load.
I'm not sure what is the expected scenario, but I don't think the GetTags
(https://github.com/owncloud/ocis/blob/master/services/graph/pkg/service/v0/tags.go#L21) will scale properly in large installations.
If we assume 1 millions files, with 500k files tagged and only 30 different tags (seems reasonable numbers), we'd need to gather the information from 500k files in the search service, send it to the graph service, and extract the 30 different tags from those files. This means we need to transfer a lot of data from one service to another (wasting time in networking), and we need to traverse all of those results to get just a few tags (too much work for very little reward). Note that memory consumption might be also a problem because all of that data will be in memory. It gets worse if we consider multiple requests from different clients made in parallel. If this is the main scenario we want to tackle, I think we need to find a different approach. Caching data might solve the problem if we manage to keep the cache up-to-date.
On the other hand, if we expect to have only 100-200 files tagged, the current approach might be good enough. Much less data transferred from one service to another, less memory usage and less processing needed. If this is a limitation, I think we should document it to make people aware that there will be a performance degradation the more files are tagged.
Going deeper to the search service, I see some potential problems in the Search
method (https://github.com/owncloud/ocis/blob/master/services/search/pkg/search/service.go#L85)
The search is divided into up to 20 workers, each one doing a piece of the search. This seems fine on paper, but I'm not sure how well it works in practice, mostly because of the additional work we're doing. Basically, we spawn up to 20 search requests and wait for the results. The problem is that, next, we need to merge those results, sort them, and get the first X results.
If we assume 5 workers and each worker returns 100k results, we're copying those results (I assume we're copying the pointers, so probably not so bad, but still 500k pointers) and sort those 500k results (likely expensive).
It's unclear to me if splitting the search work into multiple requests is better than letting bleve handle one big request. I'm not sure, but it seems bleve will go through the same data several times (once per request), and that could be slower than going through the data only once despite not doing parallel requests. I'd vote for letting bleve do its work unless there is a good reason.
In any case, there is another problem with the workers: the maximum of 20 workers applies only to the specific request. The worker pool isn't shared among the requests. This means that, in the worst scenario, each request would spawn 20 workers; assuming 20 requests in parallel that could be 400 goroutines just for searching.
Some ideas to improve the situation here:
Taking into account both previous points, I only see a couple of options to reduce the load caused by the GetTags
request:
GetTags
as a heavy operation and setup (and maintain) a cache for it.For the add and remove operations for tags, I'll need a deeper look. I assume that bleve has some write locks somewhere and there could be delays on some operations, but the ones I've found seems to be unlocked fast (I don't think the operations inside the write locks should take a lot of time), so maybe they collide with some read locks from the GetTags
operation?
We might use another index for the tags. While it can improve the performance, it also has some drawbacks and we might need to reach a compromise.
Advantages of a new index for the tags:
Disadvantages:
A reasonable compromise could be to only let the admin completely remove tags. Anyone can create a new tag, but that tag will be available forever despite not having any file with that tag. Anyone can add or remove that tag from any file (assuming the user has permissions to do so). However, only the admin can completely remove that tag; this means that the tag will also be removed from all the files.
Step 4 could take some time to complete, but after that users won't be able to see the "company-only" tag as part of the available tags (although they could still create it again).
Note that the admin flow could be implemented through command-line
when running k6 tests against a kubernetes deployment I see tons of requests failing in the add-remove-tag scenario: