Open yanivbh1 opened 12 months ago
@yanivbh1 I think this is a great idea! I think there is even some potential to take advantage of machine learning using the historical throughput of a station to alert on in conjunction with the manually set policy. Maybe automatic anomaly detection could be a cloud feature 👀
Simple "ping/pong" - periodical exchange with adapter will be good enough Adapter should run as regular client - external (not a part of multi-container)
@g41797, it's not answering the challenge. The scenario I want to tackle here is, for example: In a certain station, every 24 hours, there should be at least 100GB of produced data and 300GB of consumed data, and all of a sudden, there was only 20GB in and 50GB out. It might be nothing, but it can also be some alert that something is not working. Btw, it arose from one of our customers.
ping/pong won't be good in such a scenario.
Description
Hey, In multiple scenarios, data stopped being produced/consumed to/from a Memphis station for various reasons. A bug was found on some occasions, and in others, it was a client coding issue. Both scenarios had no crash, so clients did not write any logs. They appeared connected to Memphis, and Memphis itself did not get into an issue. Therefore, no report was made.
To overcome such a scenario and to be able to provide a higher level of observability and protection, I suggest creating a per-station ability to define a policy that will state a range of number of messages in a second that should be produced/consumed to/from a station and a difference threshold in %, meaning "if there is 50% smaller number of produced messages in a second" meaning that we have some issue and a notification should be sent.
That policy should be entirely defined by the users and per station. No pre-assumptions should be taken.
Involved components
Additional context
No response
Code of Conduct