Make sure logic around Kafka is handled properly

thoth-station / thoth-application

Thoth-Station ArgoCD Applications

GNU General Public License v3.0

12 stars 22 forks source link

Make sure logic around Kafka is handled properly #2017

Open fridex opened 3 years ago

fridex commented 3 years ago

Is your feature request related to a problem? Please describe.

As discussed on the Tech talk:

We should make sure the messages are handled properly and all the components work well with respect to message publishing and handling:

[x] make sure Kafka keeps messages published even on restarts
[ ] if a component cannot publish a message, it should be restarted and should stay in crashloop error state until the kafka is up again
[ ] all components that talk to the database to preserve state and use kafka at the same time should handle these cases as follows:
1. publish a message
2. write data to the database

Not the other way around. If a message publishing fails, an exception should be raised not to write to the database (and be in the crash loop backoff state as discussed above).

References: https://docs.google.com/document/d/1XmXYEWEwgOBHqdJkUTmhBvUb41pr5GIR2pePasvvZ1s/edit#

goern commented 3 years ago

/kind feature /priority backlog /assign @KPostOffice /assign @pacospace /triage accepted

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

KPostOffice commented 2 years ago

/remove-lifecycle stale

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

harshad16 commented 2 years ago

/lifecycle frozen /sig stack-guidance

KPostOffice commented 1 year ago

1: should be handled by strimzi already, I don't think this is on us personally as we don't own that deployment as far as I know 2: adding a check in deployment's readiness probes, there is already a function in messaging that checks if kafka is reachable 3: this will likely take the longest. It just requires a lot of reading through source code

goern commented 1 year ago

I ticked off 1 and 2

why do we preserve state in the database at all? or is this also meant as "we read our todo list from the database"? if not the later, wouldn't it be good enough to get the current state of a component from the kafka topic?

@KPostOffice do you have an example handy?

KPostOffice commented 1 year ago

@goern

See here: https://github.com/thoth-station/user-api/blob/f9a2cfe6aa9a553240a18488cdf5ab66d8de95c0/thoth/user_api/api_v1.py#L279-L284

Message is sent, then metadata is persisted to DB. You can see this pattern in all places where messages are sent in uset-api. 3 basically requires making sure these two things always happen in the correct order whenever a message is produced. If a message fails to send, we want the application state to be in a state where the message was never sent at all.

Say a new package is released and we want to send a message whenever the system sees this happen to trigger some other workflow. We only know that the package is new because it is not in the DB yet, so if the message fails to send but we add the package to the DB, then the application will never attempt to resend the message.