telefonicaid / fiware-orion

Context Broker and CEF building block for context data management, providing NGSI interfaces.
https://github.com/telefonicaid/fiware-orion/blob/master/doc/manuals/orion-api.md
GNU Affero General Public License v3.0
210 stars 265 forks source link

Usage of MetricsAPI in K8s Cluster with >1 OCB instance #4549

Closed matzeteupel closed 4 months ago

matzeteupel commented 5 months ago

Bug Description: Hello everyone, We are currently using a Fiware setup running on a Kubernetes cluster, running with three instances of the Orion Context Broker (OCB) for enhanced robustness. We want to investigate the behavior of the Fiware network when a large amount of data is transmitted via MQTT to the IoT Agent JSON (IoTA) and subsequently to the OCB. Originally, we planned to investigate the OCB's behavior using its Metrics API.

OCB, MongoDB and IoTA are replicated with >=2 replicas. MongoDB is properly set up as replica set. A service is created for each component so we can use the internal kubernetes DNS (*.svc.cluster.local). The issue arises with both resetting and querying the Metrics API: We query all three instances of the OCB are reachable via the same domain using an ingress controller (I believe via round robin) from kubernetes. The responses yield metrics that only cover a fraction of the sent messages. Apparently, the metcrics API is not global for the orion cluster but just one instance.

We would have guessed that it's a perfect round robin so each instance of the OCB is addressed after another, so that after 3 requests we would have the sum of all requests. But it seems entirely arbitrary which of the three instances is addressed. We neither know which instance we are communicating with, nor whether the previous reset was effective across all three instances, the Metrics API is not usable in this scenario. This way, the metrics API does not seem to be usable for our use case. Do you have any other idea how we can investigate the performance of the OCB in our cluster?

Versions:

Python Code example of the Get request:

# set endpoint-URL for Metrics API 
metrics_endpoint = "/admin/metrics"  

url = self.orion_url + metrics_endpoint

response = requests.get(url,
                        verify=False,
                        headers={'Authorization': 'Bearer %s' % self.token})
# Check response
if response.status_code == 200:
    data = response.json()
        try:
            #print(json.dumps(data, indent=2, sort_keys=True))
                endpoint_data = data["services"][self.fiware_service.lower()]["subservs"][
                       self.fiware_service_path.lstrip('/')]

                transactions = endpoint_data["incomingTransactions"]
                average_time = endpoint_data["serviceTime"]
                print("Transactions: ", transactions)
                total_time = average_time * transactions
                print("Total Time: ", total_time)
                print("Time/Datapoint: ", average_time)
                datapoint_time_ratio = 1 / average_time
                print("Datapoint/Time: ", datapoint_time_ratio)
                errors = endpoint_data["incomingTransactionErrors"]
                print("Lost datapoints: ", errors)
    except:
                print("Error while loading necessary metrics from successfully HTTPS request!")
                return None
else:
    print(f"Failed to get metrics from {url}, status code: {response.status_code}")
        return None

What I expect: If possible, a request that leads to a cumulative query through the URL valid for all three instances would be ideal. Otherwise, we would need to consider another working solution.

Additional Information: We use multiple security layers, where HTTP requests are encrypted using Kong (KeyCloak). The corresponding token is already integrated into the HTTP request, as visible in the code example above.

fgalan commented 5 months ago

Thank you for so detailed report!

I think the key point is this:

We would have guessed that it's a perfect round robin so each instance of the OCB is addressed after another, so that after 3 requests we would have the sum of all requests. But it seems entirely arbitrary which of the three instances is addressed

Which load balancer are you using? The one provided by Kubernets platform?

mapedraza commented 5 months ago

I you really want to know the performance of the component I suggest you to load test Orion using a single instance. Then, once you tested, you just have to multiply the throughput by the number of instances running in your cluster (you always loose a bit of performance because of the loadbalancer, network etc)

fgalan commented 5 months ago

(I'm removing the bug label to this issue, as by the moment it isn't any identified bug on Orion due to the description of this issue)

matzeteupel commented 4 months ago

Hello, thanks for the quick feedback. We use the kubernetes default rke2-ingress-nginx load balancer to distribute the data across the three OCB instances. We have noticed that the distribution to the three instances is very uneven. It often happens that despite having three instances, only two of them are used to process the transactions. Do you have any idea why this might be, or have you had any experience with this load balancer yourself and can you make any other recommendations for working with it? Thank you in advance for your feedback!

fgalan commented 4 months ago

Do you have any idea why this might be, or have you had any experience with this load balancer yourself and can you make any other recommendations for working with it?

It seem a question for Kubernetes community ;) I'd suggest to move your question there.

As far as I understand, this has nothing to do with Orion and it is an issue in the LB. In other words, each Orion instances is behaving correctly and providing the right metrics information according to the request traffic such instance is receiving (please, tell me if I'm wrong). You could try to implement a system that access directly to each instance (i.e. without going through the LB) and do the accumulation of metrics.

If my understanding is correct, I'd suggest to close this issue (as it is not related with Orion itself, but with clients or LB interfacing its API).

matzeteupel commented 4 months ago

Alright, yes I think you can close the issue then :)