owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.37k stars 180 forks source link

load testing ocis tag handling causes 90% failed requests #9821

Open butonic opened 1 month ago

butonic commented 1 month ago

when running k6 tests against a kubernetes deployment I see tons of requests failing in the add-remove-tag scenario:

# VUS=10 k6 run ~/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js --vus 10 --duration 30s

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

WARN[0000] Couldn't load source map for file:///[...]/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js  error="sourcemap: mappings are empty"
  execution: local
     script: [...]/cdperf/packages/k6-tests/artifacts/koko-platform-100-add-remove-tag-simple-k6.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 1m0s max duration (incl. graceful stop):
           * default: 10 looping VUs for 30s (gracefulStop: 30s)

     ✓ authn -> loginPageResponse - status
     ✓ authn -> authorizationResponse - status
     ✓ authn -> accessTokenResponse - status
     ✓ client -> role.getMyDrives - status
     ✓ client -> resource.getResourceProperties - status
     ✗ client -> tag.addTagToResource - status
      ↳  9% — ✓ 2 / ✗ 20
     ✓ client -> tag.getTagForResource - status
     ✗ test -> resource.getTags - name - match
      ↳  9% — ✓ 2 / ✗ 20
     ✗ client -> tag.removeTagToResource - status
      ↳  9% — ✓ 2 / ✗ 20

     checks.........................: 60.00% ✓ 90       ✗ 60  
     data_received..................: 387 kB 10 kB/s
     data_sent......................: 91 kB  2.4 kB/s
     http_req_blocked...............: avg=7.79ms  min=281ns   med=441ns    max=86.87ms  p(90)=52.03ms  p(95)=68.94ms 
     http_req_connecting............: avg=2.99ms  min=0s      med=0s       max=33.69ms  p(90)=16.62ms  p(95)=29.69ms 
     http_req_duration..............: avg=1.97s   min=13.38ms med=377.49ms max=13.07s   p(90)=7.81s    p(95)=11.39s  
       { expected_response:true }...: avg=2.49s   min=13.38ms med=303.56ms max=13.07s   p(90)=7.94s    p(95)=12.37s  
     http_req_failed................: 25.31% ✓ 40       ✗ 118 
     http_req_receiving.............: avg=93.43µs min=27.76µs med=79.2µs   max=386.42µs p(90)=153.87µs p(95)=201.4µs 
     http_req_sending...............: avg=85.78µs min=30.49µs med=73.65µs  max=285.45µs p(90)=145.71µs p(95)=182.22µs
     http_req_tls_handshaking.......: avg=3.78ms  min=0s      med=0s       max=42.71ms  p(90)=20.96ms  p(95)=36.86ms 
     http_req_waiting...............: avg=1.97s   min=13.26ms med=377.34ms max=13.07s   p(90)=7.81s    p(95)=11.39s  
     http_reqs......................: 158    4.131202/s
     iteration_duration.............: avg=16.73s  min=6.16s   med=11.31s   max=26.69s   p(90)=25.91s   p(95)=25.95s  
     iterations.....................: 22     0.575231/s
     vus............................: 1      min=1      max=10
     vus_max........................: 10     min=10     max=10

running (0m38.2s), 00/10 VUs, 22 complete and 0 interrupted iterations
default ✓ [======================================] 10 VUs  30s
jvillafanez commented 1 month ago

There are some things I've noticed around the GetTags request that could be causing a lot of load.

I'm not sure what is the expected scenario, but I don't think the GetTags (https://github.com/owncloud/ocis/blob/master/services/graph/pkg/service/v0/tags.go#L21) will scale properly in large installations.

If we assume 1 millions files, with 500k files tagged and only 30 different tags (seems reasonable numbers), we'd need to gather the information from 500k files in the search service, send it to the graph service, and extract the 30 different tags from those files. This means we need to transfer a lot of data from one service to another (wasting time in networking), and we need to traverse all of those results to get just a few tags (too much work for very little reward). Note that memory consumption might be also a problem because all of that data will be in memory. It gets worse if we consider multiple requests from different clients made in parallel. If this is the main scenario we want to tackle, I think we need to find a different approach. Caching data might solve the problem if we manage to keep the cache up-to-date.

On the other hand, if we expect to have only 100-200 files tagged, the current approach might be good enough. Much less data transferred from one service to another, less memory usage and less processing needed. If this is a limitation, I think we should document it to make people aware that there will be a performance degradation the more files are tagged.


Going deeper to the search service, I see some potential problems in the Search method (https://github.com/owncloud/ocis/blob/master/services/search/pkg/search/service.go#L85)

The search is divided into up to 20 workers, each one doing a piece of the search. This seems fine on paper, but I'm not sure how well it works in practice, mostly because of the additional work we're doing. Basically, we spawn up to 20 search requests and wait for the results. The problem is that, next, we need to merge those results, sort them, and get the first X results.

If we assume 5 workers and each worker returns 100k results, we're copying those results (I assume we're copying the pointers, so probably not so bad, but still 500k pointers) and sort those 500k results (likely expensive).

It's unclear to me if splitting the search work into multiple requests is better than letting bleve handle one big request. I'm not sure, but it seems bleve will go through the same data several times (once per request), and that could be slower than going through the data only once despite not doing parallel requests. I'd vote for letting bleve do its work unless there is a good reason.

In any case, there is another problem with the workers: the maximum of 20 workers applies only to the specific request. The worker pool isn't shared among the requests. This means that, in the worst scenario, each request would spawn 20 workers; assuming 20 requests in parallel that could be 400 goroutines just for searching.

Some ideas to improve the situation here:


Taking into account both previous points, I only see a couple of options to reduce the load caused by the GetTags request:

For the add and remove operations for tags, I'll need a deeper look. I assume that bleve has some write locks somewhere and there could be delays on some operations, but the ones I've found seems to be unlocked fast (I don't think the operations inside the write locks should take a lot of time), so maybe they collide with some read locks from the GetTags operation?

jvillafanez commented 1 month ago

We might use another index for the tags. While it can improve the performance, it also has some drawbacks and we might need to reach a compromise.

Advantages of a new index for the tags:

Disadvantages:

A reasonable compromise could be to only let the admin completely remove tags. Anyone can create a new tag, but that tag will be available forever despite not having any file with that tag. Anyone can add or remove that tag from any file (assuming the user has permissions to do so). However, only the admin can completely remove that tag; this means that the tag will also be removed from all the files.

  1. Admin wants to remove the "company-only" tag because it won't be used any longer.
  2. Admin selects the tag from the list.
  3. Admin sees that there are X files still tagged with the "company-only" tag (maybe 0 files, or maybe 1500)
  4. Admin decides to remove the tag.

Step 4 could take some time to complete, but after that users won't be able to see the "company-only" tag as part of the available tags (although they could still create it again).

Note that the admin flow could be implemented through command-line