load testing ocis tag handling causes 90% failed requests

when running k6 tests against a kubernetes deployment I see tons of requests failing in the add-remove-tag scenario:

# VUS=10 k6 run ~/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js --vus 10 --duration 30s

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

WARN[0000] Couldn't load source map for file:///[...]/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js  error="sourcemap: mappings are empty"
  execution: local
     script: [...]/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 1m0s max duration (incl. graceful stop):
           * default: 10 looping VUs for 30s (gracefulStop: 30s)

     ✓ authn -> loginPageResponse - status
     ✓ authn -> authorizationResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> role.getMyDrives - status
     ✓ client -> resource.getResourceProperties - status
     ✗ client -> tag.addTagToResource - status
      ↳  9% — ✓ 2 / ✗ 20
     ✓ client -> tag.getTagForResource - status
     ✗ test -> resource.getTags - name - match
      ↳  9% — ✓ 2 / ✗ 20
     ✗ client -> tag.removeTagToResource - status
      ↳  9% — ✓ 2 / ✗ 20

     checks.........................: 60.00% ✓ 90       ✗ 60  
     data_received..................: 387 kB 10 kB/s
     data_sent......................: 91 kB  2.4 kB/s
     http_req_blocked...............: avg=7.79ms  min=281ns   med=441ns    max=86.87ms  p(90)=52.03ms  p(95)=68.94ms 
     http_req_connecting............: avg=2.99ms  min=0s      med=0s       max=33.69ms  p(90)=16.62ms  p(95)=29.69ms 
     http_req_duration..............: avg=1.97s   min=13.38ms med=377.49ms max=13.07s   p(90)=7.81s    p(95)=11.39s  
       { expected_response:true }...: avg=2.49s   min=13.38ms med=303.56ms max=13.07s   p(90)=7.94s    p(95)=12.37s  
     http_req_failed................: 25.31% ✓ 40       ✗ 118 
     http_req_receiving.............: avg=93.43µs min=27.76µs med=79.2µs   max=386.42µs p(90)=153.87µs p(95)=201.4µs 
     http_req_sending...............: avg=85.78µs min=30.49µs med=73.65µs  max=285.45µs p(90)=145.71µs p(95)=182.22µs
     http_req_tls_handshaking.......: avg=3.78ms  min=0s      med=0s       max=42.71ms  p(90)=20.96ms  p(95)=36.86ms 
     http_req_waiting...............: avg=1.97s   min=13.26ms med=377.34ms max=13.07s   p(90)=7.81s    p(95)=11.39s  
     http_reqs......................: 158    4.131202/s
     iteration_duration.............: avg=16.73s  min=6.16s   med=11.31s   max=26.69s   p(90)=25.91s   p(95)=25.95s  
     iterations.....................: 22     0.575231/s
     vus............................: 1      min=1      max=10
     vus_max........................: 10     min=10     max=10

running (0m38.2s), 00/10 VUs, 22 complete and 0 interrupted iterations
default ✓ [======================================] 10 VUs  30s

There are some things I've noticed around the GetTags request that could be causing a lot of load.

I'm not sure what is the expected scenario, but I don't think the GetTags (https://github.com/owncloud/ocis/blob/master/services/graph/pkg/service/v0/tags.go#L21) will scale properly in large installations.

If we assume 1 millions files, with 500k files tagged and only 30 different tags (seems reasonable numbers), we'd need to gather the information from 500k files in the search service, send it to the graph service, and extract the 30 different tags from those files. This means we need to transfer a lot of data from one service to another (wasting time in networking), and we need to traverse all of those results to get just a few tags (too much work for very little reward). Note that memory consumption might be also a problem because all of that data will be in memory. It gets worse if we consider multiple requests from different clients made in parallel. If this is the main scenario we want to tackle, I think we need to find a different approach. Caching data might solve the problem if we manage to keep the cache up-to-date.

On the other hand, if we expect to have only 100-200 files tagged, the current approach might be good enough. Much less data transferred from one service to another, less memory usage and less processing needed. If this is a limitation, I think we should document it to make people aware that there will be a performance degradation the more files are tagged.

Going deeper to the search service, I see some potential problems in the Search method (https://github.com/owncloud/ocis/blob/master/services/search/pkg/search/service.go#L85)

The search is divided into up to 20 workers, each one doing a piece of the search. This seems fine on paper, but I'm not sure how well it works in practice, mostly because of the additional work we're doing. Basically, we spawn up to 20 search requests and wait for the results. The problem is that, next, we need to merge those results, sort them, and get the first X results.

If we assume 5 workers and each worker returns 100k results, we're copying those results (I assume we're copying the pointers, so probably not so bad, but still 500k pointers) and sort those 500k results (likely expensive).

It's unclear to me if splitting the search work into multiple requests is better than letting bleve handle one big request. I'm not sure, but it seems bleve will go through the same data several times (once per request), and that could be slower than going through the data only once despite not doing parallel requests. I'd vote for letting bleve do its work unless there is a good reason.

In any case, there is another problem with the workers: the maximum of 20 workers applies only to the specific request. The worker pool isn't shared among the requests. This means that, in the worst scenario, each request would spawn 20 workers; assuming 20 requests in parallel that could be 400 goroutines just for searching.

Some ideas to improve the situation here:

Always require a reasonable result limit for the search. If the results coming from the workers are sorted and we're sorting the combined result we get from the workers (sorting by the same key), I think our final result is also sorted regardless of the imposed limit. Fetching 50 results from each worker will perform better than fetching 100k results. In addition, is less data we need to copy and sort afterwards.
Having a shared pool of workers should help with the memory consumption, but might affect the performance. All the workers might be busy, so a request might need to wait a long time until a worker is available. A "too busy" error might still be acceptable for the client if the request can't finish in an acceptable amount of time.
Change the current limit per request to a lower number, something like 3 or 5 workers per requests. Note that having more workers doesn't imply a better performance; by the time a new worker might step in, a previous one might have finished its job and can pick that job. In addition, the performance gain we could have due to parallelism depends heavily on the number of CPUs available (which I'd assume either 4 or 8 at most).

Taking into account both previous points, I only see a couple of options to reduce the load caused by the GetTags request:

Consider the GetTags as a heavy operation and setup (and maintain) a cache for it.
Use a completely different approach to get the tags because it isn't fit for the current search service (as said, we might need to fetch 500k files for just 20 tags, and we need to get all the tagged files)

For the add and remove operations for tags, I'll need a deeper look. I assume that bleve has some write locks somewhere and there could be delays on some operations, but the ones I've found seems to be unlocked fast (I don't think the operations inside the write locks should take a lot of time), so maybe they collide with some read locks from the GetTags operation?

We might use another index for the tags. While it can improve the performance, it also has some drawbacks and we might need to reach a compromise.

Advantages of a new index for the tags:

Likely small dataset to work with. Up to 50 or 100 different tags.
Getting all the available tags is way faster (just traverse that index).

Disadvantages:

Tag information is very limited and unlikely to be practical. Additional fields in the index could be:
- "description" (which we can't fill)
- "lastTimeHit" to know when was the last time the tag was added or removed
- "creationTime"
- "lastFileIdTagged"
- "lastUser"
The lack of transactions, locks, or atomic operations in bleve makes some interesting information very risky to implement
- A "taggedFilesCounter" might be a very nice field to include, but there is a high risk of race conditions because we need to do get + set operations.
- Same happens with a "taggedFilesList" if we try to add or remove a file id from that list.
We won't be able to remove tags easily. In order to remove a tag we need to check if there is no file with that tag. This could be expensive if the tag is used in thousands of files.

A reasonable compromise could be to only let the admin completely remove tags. Anyone can create a new tag, but that tag will be available forever despite not having any file with that tag. Anyone can add or remove that tag from any file (assuming the user has permissions to do so). However, only the admin can completely remove that tag; this means that the tag will also be removed from all the files.

Admin wants to remove the "company-only" tag because it won't be used any longer.
Admin selects the tag from the list.
Admin sees that there are X files still tagged with the "company-only" tag (maybe 0 files, or maybe 1500)
Admin decides to remove the tag.

Step 4 could take some time to complete, but after that users won't be able to see the "company-only" tag as part of the available tags (although they could still create it again).

Note that the admin flow could be implemented through command-line

owncloud / ocis

load testing ocis tag handling causes 90% failed requests #9821