[BUG] OMRS connectivity issues if Kafka is down during the metadata repository start up

odpi / egeria

Egeria core

https://egeria-project.org

Apache License 2.0

805 stars 260 forks source link

[BUG] OMRS connectivity issues if Kafka is down during the metadata repository start up #6791

Closed pmadugundu closed 2 years ago

pmadugundu commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Note: At present, there is no proper capability in Egeria library to find the health of OMRS connection for a cohort. I think https://github.com/odpi/egeria/issues/5471 will fill this gap. The WKC services determines the status of OMRS health by calling OMRSRepositoryRESTServices.getMetadataCollectionId() method. If the method returns non-null metadata collection ID, it is considered as CONNECTED, otherwise DISCONNECTED. I understand that this is not the right approach.

I would like to report the following 2 issues to improve the resilience of OMRS connectivity from the metadata repository:

Consider the metadata repository that is already registered to a cohort is restarted , but the Kafka is down. Egeria attempts to connect to Kafka for 30 mins. In this 30 minutes period, the present method (explained above) gives wrong status, CONNECTED. And, the OMRS messages sent to the event mapper are lost. After 30 mins, the present method returns the right status, DISCONNECTED.
If Kafka is restored before 30 mins, OMRS connections is established and OMRS messages are sent to the cohort. But, if the Kafka is restored after 30 mins, the connection is never re-established.

Expected Behavior

The Egeria library should provide the right status all the time.
OMRS connection should be re-established event when Kafka is restored after 30 mins.

Steps To Reproduce

No response

Environment

- Egeria: 3.10
- OS:
- Java:
- Browser (for UI issues):
- Additional connectors and integration:

Any Further Information?

No response

mandy-chessell commented 2 years ago

I am struggling to understand the description above so forgive me as I seek clarification. Am I understanding the request correctly?

Background description of cohort functionality

Connectivity to a cohort is via multiple connections. There are the event bus connections (kafka in this case) to the three topics and there are the REST API connections to all other members to support federated queries. Kafka can be down and the cohort can still be delivering metadata to its callers without error and in that sense it is still connected.

Kafka is needed to connect a new member. If it is down the registration request needs to be retried. This is reflected in the cohort connection status. The cohort connection status is not a reflection of whether kafka is up or down but a reflection of the state of the registration to the cohort and whether the cohort manager is running. Longer term, the plan is to allow a server to control its connectivity to the cohort - such as registering and unregistering from a cohort, but this is not implemented at the moment.

The OMRSRepositoryRESTServices.getMetadataCollectionId() call listed above has no reflection on whether the kafka topic conection is up or down. It calls the local repository to see if it is initialised. A metadata server can run successfully without connecting to any cohorts. If the setting of this value seems to coincide with the connection to the cohort topics in your set up then it is a coincidence because both features are initialised at similar times.

Kafka provides background exchange of events about types and instances. The protocol is such that lost of these types of events are no problem because the federated queries drive refresh events. In general, small outages in kafka have no impact on the working of cohort, so changing the cohort connection status when kafka is down would be invalid.

My understanding of the request

What I think is being asked for is an ability to determine the health of each of the connections:

To the cohort topics.
To each of the remote members via REST

This is a good idea but it is a new feature request rather than a bug.

There has been some background work to give connectors the ability to post status and statistics. Neither the REST nor the kafka connector takes advantage of this feature yet and this would be an area of enhancement.

It is possible to see the status and statistics of the connectors though the audit log report - but since this gives the status of all components in the server, it is a cumbersome mechanism. Ideally there would be a new metadata highway REST API call to allow the status of the topic/REST connectors to be queried.

Is this what you are asking for?

pmadugundu commented 2 years ago

@mandy-chessell Thanks for your response.

The ability to determine the health of each of the cohort topic connection will address the 1st issue in my comment. And, please note that, at present, REST API is not used.

Regarding the 2nd issue in my first comment, it seem Egeria attempts to reconnect only for 30 mins. If Kafka is restored after 30 mins, the connection is never restored. Please consider this issue as well.

mandy-chessell commented 2 years ago

The federated queries is an important part of the design and is used by other organizations. If fact, although all members do not need to support it, we do not recommend running a cohort without at least one member using the federated queries.

A solution that does not consider the whole cohort protocol is not attractive.

mandy-chessell commented 2 years ago

Thinking about this more, it seems there are two design approaches.

The initial design was that the event bus (eg kafka) is an essential service and the server should terminate if kafka is down. Then we had a small refinement for a K8 environment where it would wait a short time if Kafka is down, and only exit if it did not come up during the wait time.
An alternative design is that kakfa is not considered an essential service and so the server keeps running is kafka is down. The connector is responsible for continuing to connect if there is an outage.

From the description above the connector tries to reconnect for 30 mins and then if kafka fails to start in time, the connector does not bring the server down as it is supposed to do. So, the behavior we have is in limbo between the two approaches whch makes the server difficult to control. We should make a choice about which one to go with.

If we want to go with the first design, the time that the server waits should be significantly reduced to say 5 mins and we need to make sure that the server exits if the event bus stops (or is never started).
If we want to go with the second design, we need to make the retry indefinate, and add the ability to query the status of each event bus connection - this includes the OMAS In/Out Topics as well as the cohort topics.

I am currently in favour of the second approach because a server could be connected to multiple event buses and if one was flakey it would be annoying to have it cause the server to fail - and even if all event buses were down, the server could still be operating correctly via the REST APIs.

However, I would also like @planetf1 's view on this since he is working through the design for the automated operator.

planetf1 commented 2 years ago

@pmadugundu Apologies for delay (vacation - through to Wed). First a response to the original text:

The current behaviour of the Kafka topic connector in retrying the initial connection to Kafka is configurable by the user. The docs for this are at https://egeria-project.org/connectors/resource/kafka-open-metadata-topic-connector/?h=kafka+topic#handling-kafka-cluster-bring-up-issues where the timing is explained. The defaults mean a default delay of 50s if the endpoint address is not resolvable (ie k8s service not defined), or requests fail, and 10 minutes if the endpoint is resolvable. For 30 minutes, you would have overridden the defaults. You could also set the retries to 0 which will mean we do not retry, and will simply aim to connect once, failing immediately (or at least after the Kafka api timeout of 60s if unreachable) if this does not work.
If you want more than 30 minutes, you could extend the retry period - this works. However I would suggest it's more confusing, and instead not having retries is clearer, and simpler, and you can ensure any orchestration you have to start the server is doing an appropriate query to check, and then acting accordingly. (Your second point)
More background and explanation can be seen in https://github.com/odpi/egeria/issues/6530 (the change from the fix to this issue was in ensuring the start fails if we get to the end of the retry period)
It has previously been suggested to ensure that you have a) a Kafka cluster with multiple brokers defined (>3) -- and to have a number of these set as the event broker endpoint address in egeria configuration (>3), and also to ensure you wait for Kafka before bringing the egeria server up through an initialisation container - a common practice.
If Kafka is down we cannot send messages to Kafka. There is further information on the behaviour in the above mentioned issue. By keeping the server from starting we should prevent the server from responding to requests, perhaps giving the impression it is working ok.
If during any retry period you believe the server is acting incorrectly, do point out the specifics so we can take a look at the behaviour. I'm inclined to think any REST / client API call to get the metadata collection id should fail within this period of the server not running (will check) - but see further discussion in a later post. (This relates to your first point)

planetf1 commented 2 years ago

@mandy-chessell Now to respond to your excellent description of behaviour

If the topic connector cannot connect to Kafka within the configured retry period, the server will not initialise. This was the bug fix added into 3.10 in PR #6611. So we should not be in an indeterminate state and was to address the limbo state you mention. Agree with the explanation of the metadata collection id above, use of the fine grained state query would be more appropriate imo.
The above fix targeted the first approach - ie to not start in a bad state, and provide the option for relatively immediate feedback (given connection timeout ~60s) to any request to start the server, so that the caller/orchestrator was clear on the state. Providing additional queries for status can always be useful -- however I think the intent of what is there now is workable and clear, accepting there is a limitation that if ANY topic connector fails (regardless of broker) then startup will fail
On the second approach - to some extent this is about delegation of responsibility. We would be saying the Kafka topic connector takes responsibility to 'make it so' and not bother any component with the details. That is appealing, but may set the wrong expectation since it's more likely a server would come up without Kafka, and then the user would be confused why it's not exchanging messages - even though there would be audit log messages and the ability to query status. We also don't know if the config error would be permanent or temporary. Whilst a similar situation could still occur after the server has started (making a status query still valuable) I think this abandoning of startup would keep things simple for more users more of the time.

My inclination therefore @pmadugundu , at least for now is to:

check the current calls to retrieve status are indeed showing the server is not active during the retry period (me)
suggest you either reduce the retries (to get immediate feedback), and act to orchestrate a restart in your code, or prolong the retry if you really need to
- orchestrate service startup to ensure Kafka is ready prior to starting the server with a k8s init container or similar
In addition we do still need to consider additional status reporting for connectors, but I think this is an enhancement that isn't required to satisfy your needs given the above

planetf1 commented 2 years ago

I took a look at reported status.

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/{{server}}/status

for cocoMDS2 during a period Kafka is down but the server is starting, is reporting as follows

This can appear confusing, since though the server is in the process of starting -- but we could argue the fact it is starting up, doing something, justifies a state of 'active', just that its not ready ie:

{
    "class": "ServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverName": "cocoMDS2",
    "serverType": "Metadata Access Store",
    "serverStartTime": "2022-08-07T17:07:31.688+00:00",
    "active": true
}

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/{{server}}/services

{
    "class": "ServerServicesListResponse",
    "relatedHTTPCode": 200,
    "serverName": "cocoMDS2",
    "serverServicesList": [
        "Open Metadata Repository Services (OMRS)",
        "OMAG Server Operational Services"
    ]
}

{{baseURL}}/open-metadata/platform-services/users/{{user}}/server-platform/servers/active

{
    "class": "ServerListResponse",
    "relatedHTTPCode": 200,
    "serverList": [
        "cocoMDS2"
    ]
}

planetf1 commented 2 years ago

But the key is to use the more fine grained service, which shows whether the service is starting, or actually started. -- for example

{{baseURL}}/open-metadata/admin-services/users/{{user}}/servers/{{server}}/instance/status

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverStatus": {
        "serverName": "cocoMDS2",
        "serverType": "Metadata Access Store",
        "serverActiveStatus": "STARTING",
        "services": [
            {
                "serviceName": "Open Metadata Repository Services (OMRS)",
                "serviceStatus": "STARTING"
            }
        ]
    }
}

planetf1 commented 2 years ago

Here you can see 'serverActiveStatus' says 'STARTING' - ie it is not ready yet.

You'll see audit log messages during this time

Sun Aug 07 18:50:34 BST 2022 cocoMDS2 Startup OCF-KAFKA-TOPIC-CONNECTOR-0015 The local server is attempting to connect to Kafka brokers at localhost:9092 [ attempt 3 of 10 ]

You can then wait, and if the timeout expires you will then get:

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 404,
    "exceptionClassName": "org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException",
    "actionDescription": "getActiveServerStatus",
    "exceptionErrorMessage": "OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS2 is not available to service a request from user garygeeke",
    "exceptionErrorMessageId": "OMAG-MULTI-TENANT-404-001",
    "exceptionErrorMessageParameters": [
        "cocoMDS2",
        "garygeeke"
    ],
    "exceptionSystemAction": "The system is unable to process the request because the server is not running on the called platform.",
    "exceptionUserAction": "Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.",
    "exceptionProperties": {
        "serverName": "cocoMDS2",
        "parameterName": "serverName"
    }
}

Or, if Kafka does come up in the time you will get:

{
    "class": "OMAGServerStatusResponse",
    "relatedHTTPCode": 200,
    "serverStatus": {
        "serverName": "cocoMDS2",
        "serverType": "Metadata Access Store",
        "serverActiveStatus": "RUNNING",
        "services": [
            {
                "serviceName": "Subject Area OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Security Officer OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Open Metadata Repository Services (OMRS)",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Data Privacy OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Community Profile OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Consumer OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Lineage OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Catalog OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "IT Infrastructure OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Asset Owner OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Connected Asset Services",
                "serviceStatus": "STARTING"
            },
            {
                "serviceName": "Digital Architecture OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Glossary View OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Governance Program OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Project Management OMAS",
                "serviceStatus": "RUNNING"
            },
            {
                "serviceName": "Governance Engine OMAS",
                "serviceStatus": "RUNNING"
            }
        ]
    }
}

(in this example, from our labs, we have a variety of OMAS running too)

So I think you have the ability to

Have Kafka running before you even try to start the server (k8s init container)
have retries if you wish up to 'a very long time', or not ( I think easier not to, your choice)
Detect the response from the attempt to start - failing if Kafka not available
be able to query the server for an accurate status in terms of starting or not

This seems to be sufficient.. even without any changes to how connector status is reported, or any changes to whether Kafka is essential or not (which is discussion we can continue).

@pmadugundu does this make sense?

I'd also be inclined to suggest the DEFAULT for Kafka topic connector is to NOT do any retries - I feel it's much simpler then, though I wouldn't propose removing the capability.

planetf1 commented 2 years ago

I think we have addressed the startup case. I've opened #6813 as a follow-on to track the discussion around behaviour once the server is up.