nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
344 stars 127 forks source link

[BUG]: Delay drift between kafka input data to kafka output data #1144

Open elishahaim opened 1 year ago

elishahaim commented 1 year ago

Version

streaming ransomare model

Which installation method(s) does this occur on?

No response

Describe the bug.

We have a service in the DPU that is extracting raw data (memory features snapshots) and transmit it to kafka - all the time. The time between 2 memory snapshot is ~5 seconds. So, the ransomware detection pipeline should preprocess and inference each snapshot with less than 5 seconds, for not suffering from exploding delay. To test it, if we are suffering from an exploding delay, I monitor the kafka input to watch the input snapshot ID and in the same time I monitor the kafka output to watch the output snapshot ID.

It seems like, we are suffering from a weird phenomenon that the difference between the IDs is increasing but we also are receiving pulses of huge batches of messages to kafka output that is decreasing the difference between the IDs, so even after a long time, the difference is not exploding, but we are suffering from huge delays between the current input snapshot and the output snapshot. In addition, we have a maximum for this delay - 50 snapshots (4-3 minutes)… In all the experiment that I did, we never crossed the ~50 snapshot delay…

An example to explain my description: In the beginning, the input snapshot ID is 1 and the output snapshot ID is 1. After 3 minutes, input snapshot is 36 and output snapshot is 10. After another 1 minute, input snapshot is 48 and output snapshot is 40. After another 2 minutes, input snapshot is 72 and output snapshot is 46. After another 2 minutes, input snapshot is 96 and output snapshot is 90. And so on… @bsuryadevara

Minimum reproducible example

No response

Relevant log output

No response

Full env printout

No response

Other/Misc.

No response

Code of Conduct

jarmak-nv commented 1 year ago

Hi @elishahaim!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.

elishahaim commented 1 year ago

@bsuryadevara

mdemoret-nv commented 1 year ago

@bsuryadevara Is this due to the unbounded array issue we ran into earlier?

bsuryadevara commented 1 year ago

@mdemoret-nv This issue is unrelated to the current ransomware detection pipeline within Morpheus.

Here is some context. Initially, the ransomware detection example in the Morpheus repository was deemed to be in a production-ready state. However, the Networking Business Unit (NBU) team later made significant changes to the data structure and data generation process. As a result of these changes, a new production version emerged. Instead of relying on file-based input, the system now streams snapshot messages from Kafka, which are generated by the OS inspector.

In response to Bartley's request, I assisted Haim by offering a Proof of Concept (POC) that accommodates the new production data structure for streaming input via Kafka. I made changes to the existing pipeline for this purpose. However, the scalability of the feature creation and preprocessing stages was hindered due to Dask and creating single row dataframe. This is because the new version now processes input snapshots in a sequential order, whereas previously, multiple snapshots were fed into the pipeline all at once.

Now this drift issue is resolved. @elishahaim is working on creating PR for the new version of ransomware detection pipeline example with new models (with reduced features).

mdemoret-nv commented 1 year ago

@bsuryadevara and @elishahaim Without seeing the pipeline, I cant really narrow down what could be causing this. It could be anything: batching in the pipeline, timeouts requesting services, Triton optimizing models, etc. There really is no way to narrow it down without a reproducer?

It sounds like a PR is on the way. Will this PR act as a reproducer for this issue? If not, can you provide a minimum reproducible example?

jarmak-nv commented 9 months ago

Removing the triage label here; I see PR #1176 is working on this, but it's been a while.

@elishahaim any plans to pick this up again in the 24.03 timeline?