14 - Moving from pulling policies from rego to pushing them from Anubis Policy API

Cerfoglg commented 1 year ago

Proposed changes

Instead of pulling all policies from OPA on every call being processed, we instead push the policies from the Policy API when a policy is created/edited/deleted

Types of changes

What types of changes does your code introduce to the project: Put an x in the boxes that apply

[ ] Bugfix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

[x] I have read the CONTRIBUTING doc
[x] I have signed the Contributor License Agreement
[x] I have updated the RELEASE NOTES
[x] I have added tests that prove my fix is effective or that my feature works
[x] I have added necessary documentation (if appropriate)
[ ] Any dependent changes have been merged and published in downstream modules

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

"Martel Open Source Software Individual Contributor License Agreement"

"Contributing to Anubis"

"Anubis Release Notes"

github-actions[bot] commented 1 year ago

CLA Assistant Lite bot All contributors have signed the CLA ✍️

chicco785 commented 1 year ago

see comments from my side, also when the api starts, it should push policies, otherwise this are pushed to opa only when a change occurs. we should, in a future interaction also understand that happens when opa restart for whatever reason (is there persistency?)

alternatively, we should have a "sidecar" to opa that read data from the api on start. may be better as an overall option and solves the issue also of opa restarts without need for persistency.

chicco785 commented 1 year ago

also, a minimal test would be good (e.g. check that after a policy creation the linked data is stored in opa) - python testing are failing most probably because to run them opa is not deployed

chicco785 commented 1 year ago

hi @Cerfoglg testing, i got a few errors:

e.g., when deleting a policy:

anubis-policy-api-1             | Traceback (most recent call last):
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 404, in run_asgi
anubis-policy-api-1             |     result = await app(  # type: ignore[func-returns-value]
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
anubis-policy-api-1             |     return await self.app(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/fastapi/applications.py", line 269, in __call__
anubis-policy-api-1             |     await super().__call__(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/applications.py", line 124, in __call__
anubis-policy-api-1             |     await self.middleware_stack(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
anubis-policy-api-1             |     raise exc
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
anubis-policy-api-1             |     await self.app(scope, receive, _send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 92, in __call__
anubis-policy-api-1             |     await self.simple_response(scope, receive, send, request_headers=headers)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 147, in simple_response
anubis-policy-api-1             |     await self.app(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 93, in __call__
anubis-policy-api-1             |     raise exc
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
anubis-policy-api-1             |     await self.app(scope, receive, sender)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
anubis-policy-api-1             |     raise e
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
anubis-policy-api-1             |     await self.app(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/routing.py", line 670, in __call__
anubis-policy-api-1             |     await route.handle(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/routing.py", line 266, in handle
anubis-policy-api-1             |     await self.app(scope, receive, send)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/routing.py", line 65, in app
anubis-policy-api-1             |     response = await func(request)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 227, in app
anubis-policy-api-1             |     raw_response = await run_endpoint_function(
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 162, in run_endpoint_function
anubis-policy-api-1             |     return await run_in_threadpool(dependant.call, **values)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
anubis-policy-api-1             |     return await anyio.to_thread.run_sync(func, *args)
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
anubis-policy-api-1             |     return await get_asynclib().run_sync_in_worker_thread(
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
anubis-policy-api-1             |     return await future
anubis-policy-api-1             |   File "/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
anubis-policy-api-1             |     result = context.run(func, *args)
anubis-policy-api-1             |   File "/home/apiuser/anubis-management-api/./anubis/policies/routers.py", line 676, in delete_policy
anubis-policy-api-1             |     service_path_id=db_service_path_id)
anubis-policy-api-1             | NameError: name 'db_service_path_id' is not defined
anubis-policy-api-1             | time=2023-01-12T17:07:28.442Z  | lvl=INFO:     | comp=uvicorn.access | msg="172.20.0.5:51416 - "POST /v1/audit/logs HTTP/1.1" 200
anubis-mongo-1                  | {"t":{"$date":"2023-01-12T17:07:28.545+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"WTCheckpointThread","msg":"WiredTige

when running the load script:

anubis-opa_init_script-1        | Content-Length: 103
anubis-opa_init_script-1        | 
anubis-opa_init_script-1        | {
anubis-opa_init_script-1        |   "code": "invalid_parameter",
anubis-opa_init_script-1        |   "message": "invalid character 'H' looking for beginning of value"
anubis-opa_init_script-1        | }

that's why unit test in python should also use OPA (and not only the e2e).

chicco785 commented 1 year ago

some finding on performances, with 4 CPUs on my mac:

Obtaining token from Keycloak...

Create urn:ngsi-ld:AirQualityObserved:demo entity in ServicePath / for Tenant1
===============================================================
PASSED

Run load test with Anubis in front of Orion
===============================================================
Requests      [total, rate, throughput]         800, 80.10, 79.81
Duration      [total, attack, wait]             10.023s, 9.987s, 36.432ms
Latencies     [min, mean, 50, 90, 95, 99, max]  27.016ms, 43.623ms, 39.297ms, 57.396ms, 77.414ms, 108.25ms, 175.202ms
Bytes In      [total, mean]                     104800, 131.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:800  
Error Set:

Run load test without Anubis in front of Orion
===============================================================
Requests      [total, rate, throughput]         800, 80.10, 80.08
Duration      [total, attack, wait]             9.99s, 9.987s, 2.313ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.97ms, 3.533ms, 2.785ms, 5.041ms, 6.675ms, 14.332ms, 51.552ms
Bytes In      [total, mean]                     104800, 131.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:800  
Error Set:

Delete urn:ngsi-ld:AirQualityObserved:demo entity in ServicePath / for Tenant1
===============================================================
PASSED

so with the same capacity as the test run with pull mode, with push we move from 30 reqs to 80 reqs. we can't go above. during the test with opa, the cpu usage hits 390%, only with orion 120%.

assigning 6 cpus we can go up to 110 reqs:

Obtaining token from Keycloak...

Create urn:ngsi-ld:AirQualityObserved:demo entity in ServicePath / for Tenant1
===============================================================
PASSED

Run load test with Anubis in front of Orion
===============================================================
Requests      [total, rate, throughput]         1100, 110.10, 109.51
Duration      [total, attack, wait]             10.045s, 9.991s, 53.983ms
Latencies     [min, mean, 50, 90, 95, 99, max]  31.829ms, 54.936ms, 46.295ms, 87.215ms, 98.772ms, 118.341ms, 167.205ms
Bytes In      [total, mean]                     144100, 131.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:1100  
Error Set:

Run load test without Anubis in front of Orion
===============================================================
Requests      [total, rate, throughput]         1100, 110.10, 110.07
Duration      [total, attack, wait]             9.994s, 9.991s, 3.145ms
Latencies     [min, mean, 50, 90, 95, 99, max]  2.343ms, 3.468ms, 3.218ms, 4.165ms, 4.643ms, 7.457ms, 21.368ms
Bytes In      [total, mean]                     144100, 131.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:1100  
Error Set:

Delete urn:ngsi-ld:AirQualityObserved:demo entity in ServicePath / for Tenant1
===============================================================
PASSED

is this good or bad? compared to before is almost 260% gain (with the same resources, so test 1), so it's a forward leap, i think, with the help of @c0c0n3 in the future it may be worth to understand if rego / policy format can be optimized to increase evaluation time (cf. discussion https://www.openpolicyagent.org/docs/latest/policy-performance/#high-performance-policy-decisions).

chicco785 commented 1 year ago

Please also update also README.md with configuration information and updated performance test.

c0c0n3 commented 1 year ago

Great performance improvement, well done! and yea, maybe mention/update the figures it in the docs?

also, re: performance. one thing to look out for is to avoid a situation where you push a huge amount of records from the db to opa---e.g. control policy we talked about earlier in the mtg...

c0c0n3 commented 1 year ago

just a side note. as we talked about in the past, the general assumption when replicating data in anubis is that updates are infrequent and almost never happen concurrently.

why am I bringing this up again? well, I could be wrong, but if I understand what's happening under the bonnet (and it's a big if), then policies data replicated in OPA could be inconsistent if we bring concurrent writes into the mix. Here's the scenario

Anubis holds policies for tenant Tenny: [p1, p2].
p1 is Bob's policy whereas p2 is Alice's.
Alice writes a new policy p3.
At the same time, Bob deletes p1.

What will end up in OPA? It's possible OPA eventually contains [p2] (or even [p1, p2, p3]) instead of [p2, p3] if I interpret the code correctly. In fact, here's a possible execution history:

(Bob's tread) Anubis deletes p1 and then reads all of Tenny's policies, so it gets [p2].
(Alice's thread) Anubis writes p3 and reads all of Tenny's policies, so it gets [p2, p3].
(Alice's thread) Anubis PUTs [p2, p3] into OPA.
(Bob's tread) Anubis PUTs [p2] into OPA.

orchestracities / anubis