Open dlirai opened 1 year ago
@dlirai could you share the exact model, tuples, and requests that you are making that can reproduce this issue? A reproducible example is a good first step for us to troubleshoot.
Also, what version of OpenFGA are you running? Are you just using the defaults from the Helm chart?
Yes, I am using the defaults from the Helm chart.
@dlirai hi! Could you retry your test with the latest release and let me know if it improves things? https://github.com/openfga/helm-charts/releases/tag/openfga-0.1.23
Also, when you test with more than 1 replica of OpenFGA, please note this: https://openfga.dev/docs/getting-started/running-in-production#database-recommendations
The server setting OPENFGA_DATASTORE_MAX_OPEN_CONNS should be set to be equal to your database's max connections. For example, in Postgres, you can see this value via running the SQL query SHOW max_connections;. If you are running multiple instances of the OpenFGA server, you should divide this setting equally among the instances. For example, if your database's max_connections is 100, and you have 2 OpenFGA instances, OPENFGA_DATASTORE_MAX_OPEN_CONNS should be set to 50 for each instance.
@dlirai - did you manage to retry? Did you encounter the same issue?
I am using OpenFGA's HTTP API to perform authorization checks. I encounter a weird issue: some authorization check(s) may "randomly" receive error responses, instead of "true" or "false". This could happen in all of the following three scenarios:
Scenario 1: Use OpenFGA along with the integrated Postgres database that is launched with the official Helm chart. The OpenFGA server along with the integrated Postgres database server is launched as follows:
The error responses are quite random. Sometimes, I get 1; sometimes I get 3; sometimes, I don't get any error response!! The error message is as follows:
Also, it seems that having replicaCount=1 makes the issue almost impossible to happen. Having replicaCount=3 (the default value) or replicaCount=5 makes the issue more likely to happen.
Scenario 2: Use OpenFGA with an independent Postgres database. The independent Postgres database server is launched as follows:
The OpenFGA server is launched as follows, using the official Helm chart:
In general, in Scenario 2, the issue is more likely to happen compared to Scenario 1. Even with replicaCount=1, it is still very possible for the issue to happen. Usually, 3 or 5 out of 5000 authorization checks may receive error responses.
Scenario 3: Use OpenFGA along with Azure Postgres database server. I created the Azure Postgres database server first, and then launch the OpenFGA server using the official Helm chart as follows:
The issue is even more likely to happen compared to Scenario 1 and Scenario 2. About 100 or even more checks will receive error responses, compared to just a couple in Scenario 1 or 2.
Note that I was using the same model and data for the testings in all scenarios. The authorization checks that receive error responses are different in different runs. Thus, I don't think it's the issue with my model or data.
Besides, when I use an unofficial OpenFGA Helm chart here: https://github.com/AlexandreBrg/openfga-helm to do testings in Scenario 2 and Scenario 3, I never have the same issue. I.e., the unofficial OpenFGA Helm chart works correctly all the time! Could someone help look into this issue?