Feature: Station behavior anomaly-detection policy

yanivbh1 commented 12 months ago

Description

Hey, In multiple scenarios, data stopped being produced/consumed to/from a Memphis station for various reasons. A bug was found on some occasions, and in others, it was a client coding issue. Both scenarios had no crash, so clients did not write any logs. They appeared connected to Memphis, and Memphis itself did not get into an issue. Therefore, no report was made.

To overcome such a scenario and to be able to provide a higher level of observability and protection, I suggest creating a per-station ability to define a policy that will state a range of number of messages in a second that should be produced/consumed to/from a station and a difference threshold in %, meaning "if there is 50% smaller number of produced messages in a second" meaning that we have some issue and a notification should be sent.

That policy should be entirely defined by the users and per station. No pre-assumptions should be taken.

Involved components

[ ] GUI
[ ] SDKs
[ ] Broker
[ ] Notifications channels/notifications integrations

Additional context

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

itajenglish commented 11 months ago

@yanivbh1 I think this is a great idea! I think there is even some potential to take advantage of machine learning using the historical throughput of a station to alert on in conjunction with the manually set policy. Maybe automatic anomaly detection could be a cloud feature 👀

g41797 commented 10 months ago

Simple "ping/pong" - periodical exchange with adapter will be good enough Adapter should run as regular client - external (not a part of multi-container)

yanivbh1 commented 10 months ago

@g41797, it's not answering the challenge. The scenario I want to tackle here is, for example: In a certain station, every 24 hours, there should be at least 100GB of produced data and 300GB of consumed data, and all of a sudden, there was only 20GB in and 50GB out. It might be nothing, but it can also be some alert that something is not working. Btw, it arose from one of our customers.

ping/pong won't be good in such a scenario.

superstreamlabs / memphis