ory / keto

The most scalable and customizable permission server on the market. Fix your slow or broken permission system with Google's proven "Zanzibar" approach. Supports ACL, RBAC, and more. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=keto
Apache License 2.0
4.8k stars 346 forks source link

Poor query performance as traversal width increases #1519

Open e-compton-bb opened 6 months ago

e-compton-bb commented 6 months ago

Preflight checklist

Ory Network Project

No response

Describe the bug

In our system, we have "projects" which a user can view and "clips" which a user can view. If a clip is in a project, then the user can view the clip transitively. This works fine when a clip is in a very small number of projects, but if a clip is in ~200 different projects, then it slows down to >1s to check if the user can view the clip.

Running Keto locally, it appears to be churning through requests against SQL, loading every relationship into memory before checking if the user has access to any of them.

Reproducing the bug

Permissions model:

class User implements Namespace {}

class UserGroup implements Namespace {}

class Project implements Namespace {
  related: {
    owners: User[];
    editors: (User | UserGroup)[];
  };
  permits = {
    view: (ctx: Context): boolean =>
      this.related.owners.includes(ctx.subject) || this.related.editors.includes(ctx.subject),
  };
}

class Clip implements Namespace {
  related: {
    owners: User[];
    parents: Project[];
  };
  permits = {
    view: (ctx: Context): boolean =>
      this.related.owners.includes(ctx.subject) || this.related.parents.traverse((parent) => parent.permits.view(ctx)),
  };
}

Create 200 relationships of different projects all pointing to the same clipId, make a user an owner or editor on any one of the project.

Check if the user can access the clip

Expected: The query to still be fairly fast <50ms Actual: On an M2 Mac with SQLite >150ms, but on a real database deployed in a container can take >1s

Relevant log output

No response

Relevant configuration

log:
  level: warn

namespaces:
  location: file:///home/nonroot/model.ts

serve:
  read:
    host: 0.0.0.0
    port: 4466
  write:
    host: 0.0.0.0
    port: 4467

Version

v0.12.0

On which operating system are you observing this issue?

macOS

In which environment are you deploying?

Docker

Additional Context

No response

e-compton-bb commented 6 months ago

After some more digging, the main problem here seems to be the breadth-first search in checkExpandSubject. It grabs every Project tuple from the DB and then it starts looking through each one for the user. Then each lookup is a separate query which has it's own round trip time. Which explains why it's so much slower in a real database.

aeneasr commented 6 months ago

Hey Edward, I believe that this is not yet optimized properly. If you have good ideas how this could be optimized we're happy to hear them!

e-compton-bb commented 6 months ago

Could switch from a breadth-first search to a depth-first search. In this case the relationship chain we're trying to find is: Clip -> Project -> User, so by enumerating all the Projects first we always get worst-case time.

There's also a lot of chattiness between the container and the DB, so if more work could be offloaded to the DB it could make a big difference. This seems like a really good use-case for Postgres's WITH RECURSIVE clauses, although the current implementation seems to be independent of particular SQL versions https://www.dylanpaulus.com/posts/postgres-is-a-graph-database

stephenpmurray commented 2 months ago

Hi, we're having performance issues with some instances of keto using sqlite which sound very similar.

Actual: On an M2 Mac with SQLite >150ms, but on a real database deployed in a container can take >1s

What is the 'real database' that you are using?

e-compton-bb commented 2 months ago

Hi, we're having performance issues with some instances of keto using sqlite which sound very similar.

Actual: On an M2 Mac with SQLite >150ms, but on a real database deployed in a container can take >1s

What is the 'real database' that you are using?

I think for that benchmark it was just the official postgres docker image, but our main app is using AWS Aurora RDS running Postgres which saw the worst performance of the three