permitio / opal

Policy and data administration, distribution, and real-time updates on top of Policy Agents (OPA, Cedar, ...)
https://opal.ac
Apache License 2.0
4.92k stars 176 forks source link

Issue with performance and iterating over tar policy having large number of files #241

Open prashil-g opened 2 years ago

prashil-g commented 2 years ago

Describe the bug Seeing timeout and crash of worker threads when policy tar containing large numeber of policy is pushed

To Reproduce running the api source docker compose reference provided in the permitio github repo and sending policy bundle attached bundle.tar.gz


[10:54](https://permit-io.slack.com/archives/C03BJREJGBU/p1651166648079409)
version: "3.8"
services:
  # When scaling the opal-server to multiple nodes and/or multiple workers, we use
  # a *broadcast* channel to sync between all the instances of opal-server.
  # Under the hood, this channel is implemented by encode/broadcaster (see link below).
  # At the moment, the broadcast channel can be either: postgresdb, redis or kafka.
  # The format of the broadcaster URI string (the one we pass to opal server as `OPAL_BROADCAST_URI`) is specified here:
  # https://github.com/encode/broadcaster#available-backends
  broadcast_channel:
    image: postgres:alpine
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres

  opal_server:
    # by default we run opal-server from latest official image
    image: permitio/opal-server:latest
    environment:
      # the broadcast backbone uri used by opal server workers (see comments above for: broadcast_channel)
      - OPAL_BROADCAST_URI=postgres://postgres:postgres@broadcast_channel:5432/postgres
      # number of uvicorn workers to run inside the opal-server container
      - UVICORN_NUM_WORKERS=4
      # the url of the Api bundle server hosting our policy
      # - you can pass a token if you need to authentication via `POLICY_BUNDLE_SERVER_TOKEN`
      # - in this example we use nginx server that serve static bundle.tar.gz files without token
      # - our bundle server is compatible with OPA bundle server
      # - for more info, see: hthttps://www.openpolicyagent.org/docs/latest/management-bundles/
      - OPAL_POLICY_BUNDLE_URL=http://api_policy_source_server/
      - OPAL_POLICY_SOURCE_TYPE=API
      # - the base path for the local git in Opal server
      - OPAL_POLICY_REPO_CLONE_PATH=~/opal
      # in this example we will use a polling interval of 30 seconds to check for new policy updates (new bundle files).
      # however, it is better to utilize a api *webhook* to trigger the server to check for changes only when the bundle server has new bundle.
      # for more info see: https://github.com/permitio/opal/blob/master/docs/HOWTO/track_an_api_bundle_server.md
      - OPAL_POLICY_REPO_POLLING_INTERVAL=30
      # configures from where the opal client should initially fetch data (when it first goes up, after disconnection, etc).
      # the data sources represents from where the opal clients should get a "complete picture" of the data they need.
      # after the initial sources are fetched, the client will subscribe only to update notifications sent by the server.
      - OPAL_DATA_CONFIG_SOURCES={"config":{"entries":[{"url":"http://host.docker.internal:7002/policy-data","topics":["policy_data"]}]}}
      - OPAL_LOG_FORMAT_INCLUDE_PID=true
    ports:
      # exposes opal server on the host machine, you can access the server at: http://localhost:7002/
      - "7002:7002"
    depends_on:
      - broadcast_channel

  opal_client:
    # by default we run opal-client from latest official image
    image: permitio/opal-client:latest
    environment:
      - OPAL_SERVER_URL=http://opal_server:7002/
      - OPAL_LOG_FORMAT_INCLUDE_PID=true
      - OPAL_INLINE_OPA_LOG_FORMAT=http
    ports:
      # exposes opal client on the host machine, you can access the client at: http://localhost:7000/
      - "7000:7000"
      # exposes the OPA agent (being run by OPAL) on the host machine
      # you can access the OPA api that you know and love at: http://localhost:8181/
      # OPA api docs are at: https://www.openpolicyagent.org/docs/latest/rest-api/
      - "8181:8181"
    depends_on:
      - opal_server
    # this command is not necessary when deploying OPAL for real, it is simply a trick for dev environments
    # to make sure that opal-server is already up before starting the client.
    command: sh -c "./wait-for.sh opal_server:7002 --timeout=20 -- ./start.sh"

  # Demo bundle server to serve the policy
  api_policy_source_server:
    # we use nginx to serve the bundle files
    image: nginx
    # expose internal port 80 to localhost 8000
    ports:
      - 8000:80
    # map files into the docker to edit nginx conf and put the bundle files into the container
    volumes:
      - ./docker_files/bundle_files:/usr/share/nginx/html
      - ./docker_files/nginx.conf:/etc/nginx/nginx.conf
image

Expected behavior Tar ball containing large number of policy rego should not take more time to untar

Screenshots

image

OPAL version

Issue was identified with tarsafe wrapper. If tarsafe wrapper isnt used, untarring the tar with large number of files is very fast and no worker crash seen image

orweis commented 2 years ago

FYI: @obsd @roekatz @asafc

prashil-g commented 2 years ago

Hi @orweis Did you find any fix for this?

obsd commented 2 years ago

Hi @prashil-g, thanks for the detailed bug description. Do you consider the tar file to come from a safe source (no user input on file names etc..)? I'm investigating it now and it will help me to come up with more possible solutions.

prashil-g commented 2 years ago

We do trust the source. But that might not be the case for most users

obsd commented 2 years ago

@prashil-g sorry for a long time it took me I've pushed a suggestion that may help you, I will consult with @orweis and @asafc about it meanwhile, I would love to hear your thoughts as well https://github.com/permitio/opal/pull/284 Note that this is a draft PR

gemanor commented 1 month ago

@orweis have we got more requirements like this after the PR or we can close the issue?

orweis commented 1 month ago

Not that I'm aware of