owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.41k stars 184 forks source link

The goroutines leak in a single binary `ocis` #10577

Open 2403905 opened 2 days ago

2403905 commented 2 days ago

Describe the bug

A single binary ocis setup has high memory usage and the goroutines leak after long-running under the loading.

Steps to reproduce

  1. Run the ocis localy
  2. Use the Prometeus + Grafana
  3. Run any test like make test-acceptance-api \\nTEST_SERVER_URL=https://localhost:9200 \\nTEST_WITH_GRAPH_API=true \\nTEST_OCIS=true \\nBEHAT_FEATURE=tests/acceptance/features/apiSpacesShares/shareUploadTUS
  4. Grab the goroutine profile curl 'http://localhost:9114/debug/pprof/goroutine' > goroutine-running.out
  5. Grab the goroutine logs curl 'http://localhost:9114/debug/pprof/goroutine?debug=2' > goroutine-running.log
  6. Stop tests and grab profiles when the number of goroutines stopped moving down.
  7. Grab the goroutine profile curl 'http://localhost:9114/debug/pprof/goroutine' > goroutine-running.out
  8. Grab the goroutine logs curl 'http://localhost:9114/debug/pprof/goroutine?debug=2' > goroutine-running.log

Expected behavior

The number of goroutines should go down after the load goes down.

Actual behavior

The number of goroutines is growing during the test and should go down after the test stops. But we can see that thousands of goroutines are blocked and never go away. Most of them are google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run and github.com/owncloud/ocis/v2/ocis-pkg/natsjsregistry.(*Watcher).Next. It leads to the goroutie and memory leak. The amount of time goroutines were blocked, reaching the approximate execution time of the ocis binary. The locks are related to the go-micro/registry cache and ocis-pkg/natsjsregistry Watcher.

goroutine 113937 [chan receive, 112 minutes]:
github.com/owncloud/ocis/v2/ocis-pkg/natsjsregistry.(*Watcher).Next(0xc001ea4030)
  /Users/roman/projects/owncloud/ocis/ocis-pkg/natsjsregistry/watcher.go:46 +0x33
go-micro.dev/v4/registry/cache.(*cache).watch(0xc007349a70, {0x112427058, 0xc001ea4030})
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:439 +0xc6
go-micro.dev/v4/registry/cache.(*cache).run(0xc007349a70, {0x1102758d8, 0x19})
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:394 +0x399
created by go-micro.dev/v4/registry/cache.(*cache).get in goroutine 81855
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:200 +0x2e5 

goroutine 23171 [chan receive, 68 minutes]:
github.com/owncloud/ocis/v2/ocis-pkg/natsjsregistry.(*Watcher).Next(0xc000ff9140)
  /Users/roman/projects/owncloud/ocis/ocis-pkg/natsjsregistry/watcher.go:46 +0x33
go-micro.dev/v4/registry/cache.(*cache).watch(0xc005288cf0, {0x112427058, 0xc000ff9140})
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:439 +0xc6
go-micro.dev/v4/registry/cache.(*cache).run(0xc005288cf0, {0x1102758d8, 0x19})
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:394 +0x399
created by go-micro.dev/v4/registry/cache.(*cache).get in goroutine 2379
  /Users/roman/projects/owncloud/go-micro/registry/cache/cache.go:200 +0x2e5
Screenshot 2024-11-15 at 12 54 50 Screenshot 2024-11-15 at 12 55 06 Screenshot 2024-11-15 at 12 53 34 Screenshot 2024-11-15 at 12 53 52

goroutine-out.zip

Setup

Please describe how you started the server and provide a list of relevant environment variables or configuration files.

```console OCIS_XXX=somevalue OCIS_YYY=somevalue PROXY_XXX=somevalue ```

Additional context

Add any other context about the problem here.

kobergj commented 2 days ago

Do we know if the amount of goroutines in natsjsregistry.Watcher is going up? It should not as they are only spawned on startup. There should be no new goroutines added there during runtime.

butonic commented 2 days ago

AFAICT we are sometimes creating a new go micro client on the fly, eg. in services/graph/pkg/service/v0/drives.go:


// ListStorageSpacesWithFilters List Storage Spaces using filters
func (g Graph) ListStorageSpacesWithFilters(ctx context.Context, filters []*storageprovider.ListStorageSpacesRequest_Filter, unrestricted bool) (*storageprovider.ListStorageSpacesResponse, error) {
    gatewayClient, err := g.gatewaySelector.Next()
    if err != nil {
        return nil, err
    }

    grpcClient, err := grpc.NewClient(append(grpc.GetClientOptions(g.config.GRPCClientTLS), grpc.WithTraceProvider(g.traceProvider))...)
    if err != nil {
        return nil, err
    }
    s := settingssvc.NewPermissionService("com.owncloud.api.settings", grpcClient)

    _, err = s.GetPermissionByID(ctx, &settingssvc.GetPermissionByIDRequest{
        PermissionId: settingsServiceExt.ListSpacesPermission(0).Id,
    })
    //...

The problem is that the actual RPC call under the hood uses a selector. And all go micro selectors wrap the registry with a cache registry. That will start a noew watcher. Increasing the number of go routinges by one. On every requests that instantiates a now go micro client.

So, for calls to reva we need te make sure not to create a reva gateway client on service startup. But for calls with go micre clients we need to make sure to not recreate them on every request.

🤪

butonic commented 2 days ago

one occurrence fixed with https://github.com/owncloud/ocis/pull/10583

micbar commented 2 days ago

Keeping it open, could be an issue in more places.

2403905 commented 2 days ago

@butonic That is awesome!

The number of goroutines is stable and memory usage is low

Screenshot 2024-11-15 at 17 23 37