OpenFGA returns unexpected errors randomly

dlirai commented 1 year ago

I am using OpenFGA's HTTP API to perform authorization checks. I encounter a weird issue: some authorization check(s) may "randomly" receive error responses, instead of "true" or "false". This could happen in all of the following three scenarios:

Scenario 1: Use OpenFGA along with the integrated Postgres database that is launched with the official Helm chart. The OpenFGA server along with the integrated Postgres database server is launched as follows:
```
helm install openfga openfga/openfga \
--set datastore.engine=postgres \
--set datastore.uri="postgres://postgres:password@openfga-postgresql.default.svc.cluster.local:5432/postgres?sslmode=disable" \
--set postgres.enabled=true \
--set postgresql.auth.postgresPassword=password \
--set postgresql.auth.database=postgres
```
The error responses are quite random. Sometimes, I get 1; sometimes I get 3; sometimes, I don't get any error response!! The error message is as follows:
```
{
"code": "deadline_exceeded",
"message": "context deadline exceeded"
}
```
Also, it seems that having replicaCount=1 makes the issue almost impossible to happen. Having replicaCount=3 (the default value) or replicaCount=5 makes the issue more likely to happen.
Scenario 2: Use OpenFGA with an independent Postgres database. The independent Postgres database server is launched as follows:
```
helm install dlpostgres \
    --set auth.postgresPassword=password \
    oci://registry-1.docker.io/bitnamicharts/postgresql
```
The OpenFGA server is launched as follows, using the official Helm chart:
```
helm install openfga openfga/openfga \
    --set replicaCount=1 \
    --set log.level=error \
    --set datastore.engine=postgres \
    --set datastore.uri="postgres://postgres:password@dlpostgres-postgresql.default.svc.cluster.local:5432/postgres?sslmode=disable"
```
In general, in Scenario 2, the issue is more likely to happen compared to Scenario 1. Even with replicaCount=1, it is still very possible for the issue to happen. Usually, 3 or 5 out of 5000 authorization checks may receive error responses.
Scenario 3: Use OpenFGA along with Azure Postgres database server. I created the Azure Postgres database server first, and then launch the OpenFGA server using the official Helm chart as follows:
```
helm install openfga openfga/openfga \
    --set replicaCount=1 \
    --set log.level=error \
    --set datastore.engine=postgres \
    --set datastore.uri="CONNECTION_STRING_FROM_AZURE_POSTGRES_DATABASE_SERVER"
```
The issue is even more likely to happen compared to Scenario 1 and Scenario 2. About 100 or even more checks will receive error responses, compared to just a couple in Scenario 1 or 2.

Note that I was using the same model and data for the testings in all scenarios. The authorization checks that receive error responses are different in different runs. Thus, I don't think it's the issue with my model or data.

Besides, when I use an unofficial OpenFGA Helm chart here: https://github.com/AlexandreBrg/openfga-helm to do testings in Scenario 2 and Scenario 3, I never have the same issue. I.e., the unofficial OpenFGA Helm chart works correctly all the time! Could someone help look into this issue?

jon-whit commented 1 year ago

@dlirai could you share the exact model, tuples, and requests that you are making that can reproduce this issue? A reproducible example is a good first step for us to troubleshoot.

Also, what version of OpenFGA are you running? Are you just using the defaults from the Helm chart?

dlirai commented 1 year ago

Yes, I am using the defaults from the Helm chart.

miparnisari commented 10 months ago

@dlirai hi! Could you retry your test with the latest release and let me know if it improves things? https://github.com/openfga/helm-charts/releases/tag/openfga-0.1.23

Also, when you test with more than 1 replica of OpenFGA, please note this: https://openfga.dev/docs/getting-started/running-in-production#database-recommendations

The server setting OPENFGA_DATASTORE_MAX_OPEN_CONNS should be set to be equal to your database's max connections. For example, in Postgres, you can see this value via running the SQL query SHOW max_connections;. If you are running multiple instances of the OpenFGA server, you should divide this setting equally among the instances. For example, if your database's max_connections is 100, and you have 2 OpenFGA instances, OPENFGA_DATASTORE_MAX_OPEN_CONNS should be set to 50 for each instance.

rhamzeh commented 5 months ago

@dlirai - did you manage to retry? Did you encounter the same issue?

openfga / helm-charts

OpenFGA returns unexpected errors randomly #44