stellio-hub / stellio-context-broker

Stellio is an NGSI-LD compatible context broker
https://stellio.readthedocs.io
Apache License 2.0
27 stars 10 forks source link

Performance and Subscription Issues During Operation in Kubernetes Using Custom Helm Chart #1237

Open michelbarnich opened 2 months ago

michelbarnich commented 2 months ago

Title: Performance and Subscription Issues During Deployment in Kubernetes Using Custom Helm Chart

Description:

We are currently deploying Stellio in a Kubernetes environment using our own Helm Chart.

We are using the following versions for the different components:

During performance tests, several key issues and observations were identified regarding resource usage, data insertion, and subscription behavior. Below is a detailed summary of our findings:

Performance Tests Overview

{
    "id": "urn:ngsi-ld:<Datasource>:<Location>:103280",
    "type": "<Datasource>",
    "direction_degree": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "knots_mean_10m": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "knots_max_10m": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "degree_celsius_2m": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "degree_celsius_dew": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "air_pressure": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "air_pressure_geopotential_height": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "precipitation_10m": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "radiation_10m": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "seconds": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "wind_speed_mps": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "wind_gust_mps": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "relative_humidity": {
        "type": "Property",
        "value": <float>,
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "location": {
        "type": "Property",
        "value": {
            "type": "Point",
            "coordinates": [
                <x>,
                <y>
            ]
        }
    },
    "address": {
        "type": "Property",
        "value": {
            "PostalAddress": {
                "addressCountry": "DE",
                "addressLocality": "<Location>"
            }
        }
    },
    "dataProvider": {
        "type": "Property",
        "value": "<Sensor Name>-Sensor"
    },
    "dateObserved": {
        "type": "Property",
        "value": "2022-06-24T12:30:00+00:00",
        "observedAt": "2022-06-24T12:30:00+00:00"
    },
    "@context": [
        "https://sample.context-file"
    ]
}

The data is sent to the following endpoint: http://<stellio API service>/ngsi-ld/v1/entityOperations/upsert?options=update

This is our subscription:

{
    "type": "Subscription",
    "subscriptionName": "Subscription for entity type <type>.",
    "description": "This subscription triggers everytime an entity from type <type>> is updated. Only watching attribute energy to avoid duplications.",
    "entities": [
        {
            "type": "<type>"
        }
    ],
    "watchedAttributes": ["dateObserved"],
    "notificationTrigger": ["entityCreated", "attributeCreated", "attributeUpdated"],
    "notification": {
        "format": "normalized",
        "endpoint": {
            "uri": "http://<quantumleap service>/v2/notify",
            "accept": "application/json"
        }
    },
    "@context": [
        "https://<context host>"
    ]
}

Issues and Observations

1. Inserting/Updating Entities One-by-One:

Graphs for CPU and IOPS during tests:

1

2

Is there a way for us to improve the ressource usage?

2. Subscriptions and Insertion Behavior:

Example of duplicated/wrong updates (query results):

5

The timestamp and/or entity_id ends up being equal to other entries, even though when the messages were originally sent to Stellio, the entity_id and timestamp were different. Stellio accidentally merges multiple messages together, resulting in wrong entries for certain entities or timestamps.

3. Batch Insertion Performance:

Graph showing improved resource usage with batches: 6

4. Kafka Configuration:

5. Subscription Component Behavior:

6. API-Gateway Container Crashes in Load Tests (Other Environments):

7. Postgres max connections:

Stellio doesnt seem to use one (or maybe a couple) open connections to its DB, but rather opens a new connection for each message it receives. Under high load, this will lead to an issue in the Database:

High message rates trigger the "remaining connection slots are reserved for non-replication superuser connections"

This could be handled in 2 ways: pooling connections/transactions or using a PG Bouncer.

We hope this feedback is helpful, and we’d appreciate any insights or recommendations on addressing these issues, particularly around Kafka configurations, PostgreSQL reconnections, subscriptions during batch inserts, and the API-Gateway crashes.

bobeal commented 2 months ago

Hi @michelbarnich,

Many thanks for this detailed report and very sorry for this late reply.

I have first a general question: why are you using QuantumLeap? It was typically used for NGSIv2 context brokers because there was no temporal API in NGSIv2, but is generally not used with NGSI-LD context brokers which have a native temporal API. I had a quick look at the QuantumLeap repository on GH, not sure the NGSI-LD support is complete (nothing really new for NGSI-LD since 2021, which is a really long time for NGSI-LD!). We did something a bit similar with Apache NiFi to export denormalized representations of entities easily usable in a BI tool like Apache Superset and it is quite complex to support the many features of NGSI-LD...

The data is sent to the following endpoint: http://<stellio API service>/ngsi-ld/v1/entityOperations/upsert?options=update

I don't think it will change a lot the numbers but you could use the Batch Entity Merge endpoint instead. Batch Entity Update will replace the attributes found in the payload, Batch Entity Merge will merge them.

1. Inserting/Updating Entities One-by-One:

* **Performance Limit:** The highest rate achieved was 10 msg/s. Increasing resources did not allow for a 20 msg/s rate.

The results are surprising to me. We ran a performance campaign beginning of 2024 and we achieved quite better results on a single "standard" VM. You can see some numbers on https://stellio.readthedocs.io/en/latest/admin/performance.html (we'll run them again after the fix with the DB connection pool, see at the end of my answer)

The main difference between the two configurations is that we were running Stellio in docker-compose and you are deploying it in a Kubernetes cluster.

Do you mind if I use your Helm charts? I'd like to try to reproduce as much as possible your environment and analyze what is the cause of the performance issues.

* **Resource Usage:** Significant memory was required for Stellio components (API, Search, Subscription), especially memory (~2.9 GB RAM per component). PostgreSQL and all Stellio components together reached over 400 IOPS, almost 5 CPU, and around 10 GB memory.

Is there a way for us to improve the ressource usage?

The total RAM used is what we typically see in our deployments. Typically, the components using most of it are PostgreSQL and Kafka. Then come search and subscription services. The IOPS are very high but this is not a problem. It only means that you have good storage performance :)

2. Subscriptions and Insertion Behavior:

* Subscriptions trigger correctly, but data in Quantumleap is often duplicated or incorrectly updated. We suspect a race condition between the Search and Subscription components.

What is in the time_index column? The count column is the count of unique pairs of (entity_id, time_index)?

I don't know how QuantumLeap works but there is one thing to keep in mind. As explained in https://stellio.readthedocs.io/en/latest/user/internal_event_model.html, if you update two attributes in an entity, Stellio will internally trigger two events that may be end up in two notifications with the same content being sent (depending on the subscription). This behavior may cause some duplications if not properly handled.

3. Batch Insertion Performance:

* Sending data in batches significantly improved resource efficiency. A rate of 10 msg/s allowed for sending ~15,000 entities using batches of 50. Higher rates (20 msg/s) were achievable with batches of 20 entities.

* Unfortunately, subscriptions often failed to store any data in Quantumleap when batches were used.

What was the problem? Subscription service not able to send the data or QuantumLeap not able to handle the rate?

4. Kafka Configuration:

* Kafka is writing data to local disk on the Kubernetes nodes which causes problems, since the disk at some point is full. Is this the message which are stored? Would it make sense to set a retention for the messages, since the Stellio components do not need the history of the messages?

Among other things (https://stellio.readthedocs.io/en/latest/user/internal_event_model.html), Kafka is used to decouple the communication between search and subcription services. If only used for the communication between the two services, you can safely set a low retention time in Kafka (IIRC, by default, Kafka has a 7 days retention period).

6. API-Gateway Container Crashes in Load Tests (Other Environments):

* In some environments, we have observed that the API-Gateway container crashes a few minutes into a load test without any apparent reason. Resource usage remains normal, and there are no errors in the logs to indicate why this is happening.

* Could there be a way to increase the logging level for more detailed diagnostics to help identify the root cause?

Yes, you can follow this to change the log level of a module: https://stellio.readthedocs.io/en/latest/admin/misc_configuration_tips.html#change-the-log-level-of-a-library-namespace (you can use LOGGING_LEVEL_ROOT to change the root log level)

7. Postgres max connections:

Stellio doesnt seem to use one (or maybe a couple) open connections to its DB, but rather opens a new connection for each message it receives. Under high load, this will lead to an issue in the Database:

High message rates trigger the "remaining connection slots are reserved for non-replication superuser connections"

This could be handled in 2 ways: pooling connections/transactions or using a PG Bouncer.

Indeed, you spotted a big issue here! There was an issue in the configuration of the connection pool and it was not properly running. I created a PR (https://github.com/stellio-hub/stellio-context-broker/pull/1241) which will be internally reviewed today. Once validated, I will publish a fix release of Stellio.

From the test I've done, it should also fix the problem with the subscription service struggling to reconnect to the DB.

michelbarnich commented 2 months ago

Hello,

Thank you for your answer. Of course you can use our Helm Chart for testing. During the day, I will come back with answers to your questions and suggestions.

Thank you very much!

michelbarnich commented 2 months ago

Hi @bobeal,

I wanted to provide some additional context and updates regarding our setup and results:

General Question:

We use Qualtumleap for a de-normalized representation of entities.

1:

Test Results: After reviewing your test results, we suspect that the performance degradation might be due to larger entities in our system.

2:

We have had the problem you describe and we therefor only subscribe to one field in the entity which we know will always be update, i.e. "dateObserved". It can therefor not be because of the workings of the internal event model.

time_index is an attribute set by Quantumleap. When Quantumleap gets a notification it will check all observed_at properties and use the newest one to set the time_index.

3:

Subscription Service: From our analysis, it seems that the subscription service might struggle to check subscription triggers correctly, though this is still a suspicion at this stage.

Thanks for your suggestions and the Pullrequest. We’ll test it out and report back.

bobeal commented 1 month ago

Hi @michelbarnich,

General Question:

We use Qualtumleap for a de-normalized representation of entities.

Ok, so similar to what we are doing with NiFi :)

1:

Test Results: After reviewing your test results, we suspect that the performance degradation might be due to larger entities in our system.

I am currently running our load test suite to get fresh new numbers. I'll add some tests with an entity similar to the one you are using and see how it behaves.

2:

We have had the problem you describe and we therefor only subscribe to one field in the entity which we know will always be update, i.e. "dateObserved". It can therefor not be because of the workings of the internal event model.

time_index is an attribute set by Quantumleap. When Quantumleap gets a notification it will check all observed_at properties and use the newest one to set the time_index.

OK, let me think and do some tests about this one.

michelbarnich commented 1 month ago

Hi there,

I hope you're doing well! I wanted to check in and see if there have been any updates on the performance issue. We’re planning to run some tests soon for choosing a broker for a lighthouse project and would love to know if there's anything new we should be aware of before proceeding. Thanks for your time and support!

Best regards, Michel

bobeal commented 1 month ago

Hi @michelbarnich,

The problem with the DB connection pool has been fixed and is part of the recently released version 2.17.1.

We took this opportunity to run our load test suite and results were indeed better.

We have then been a bit busy on other topics but in the next two weeks, we are planning to:

Beginning of November, we should have some time to be able to use your Helm charts and do some testing in a k8s environment.

I also keep in mind the problem with the subscription service (I was wondering if it was not also related to the DB connection pool...).

Btw, were you able to get more info about the crashing of API-Gateway container during the load tests?

Regards,

Benoit.

bobeal commented 1 month ago

Hello @michelbarnich,

We just did some tests with a larger entity having the same number of attributes (and same "topology") than the one you provided in the issue.

We noticed (using the exact same hardware configuration as for our previous load tests):

At first sight, it seems consistent with our previous results. We have some ideas to improve the creation time, we'll work on them soon.

I'll let you know when we have some progress on the other topics.