verzulli commented 2 years ago

As for "Flow STATES" and "Flow EVENTS" (whose description is reported in the README), I'm trying to better understand what exactly they means.

After running for ~24 hours my realtime analyzer receiving the UDP-stream of an nDPId instance running at the border of a small set of VPSs, I got these numbers:

numbers

please note that I'm interested ONLY on "flows" tracking.

As you can see, I got 1.735.579 messages ("in"), succesfully processed as JSONs ("ok"), with zero errors ("err").

From those JSONs the analyzer skipped the 2111 JSONs NOT related to flows, and focused to the others 1.733.468.

From those 1.733.468 flow JSONs, it extracted "flow_state" and "flow_event_name", combining them in a string and counting related groups.

With a show counter I got the numbers of occurrencies of those strings and, as you can see, I got:

info/new: 460232
info/detected: 429441
info/detection-update: 344050
info/not-detected: 28158
finished/end: 82356
info/guessed: 3112
info/end: 26788
finished/idle: 188938
finished/update: 7307
info/idle: 161786
info/update: 1300

whose sum is exactly 1.733.468

I'm trying to figure out the state-diagram used by nDPI, to understand exactly what's the event (and the state) that signal the termination of the activities performed by nDPI. I guess it's "finished/end".... but a "info/end" makes me in trouble :-(

I scratched down following diagram:

state_diagram

Could you be so kind to explain me WHICH EVENT I should focus, to let me know when exactly nDPI will finish processing flow... so that I can expect that no other events, related to that flow, will be received by my analizer?

At the moment, I'm keeping track of "everything", with an always-increasing memory-map of EVERY flows. What I want to achieve is it EXTRACT "completed flows" from such a table and forward them to next processing stage

Sorry if this sounds a bit cumbersome: I understand I'm not exactly clear with this request.... :-(

utoni commented 2 years ago

Thank you for your question. It shows me that I have to improve documenting my work. You can treat flow states and flow events separately.

Flow States

You can skip this part most likely. For TL;DR: They are not really useful and just a representation of nDPId's internal flow processing states.

Example (TCP connection):

nDPId receives a TCP-SYN: it allocates memory for internal flow tracking and libnDPI packet processing.
If memory allocation was successful, nDPId set's the state to info.
nDPId processes the next incoming packets until at least one of the following conditions are true:
- max-packets-per-flow-to-process is reached, defaults to NDPI_DEFAULT_MAX_NUM_PKTS_PER_FLOW_TO_DISSECT and can be set via -o max-packets-per-flow-to-process=number
- flow detection is completed and no further dissection possible
nDPId sets the state to finished

Flow Events

The most interesting part.

Example (TCP connection):

nDPId receives a TCP-SYN: new event is generated
the first packet with layer7 payload arrives and nDPId was able to detect the layer 7 protocol: detected event is generated
more packets arrive and nDPId is able to dissect more data: subsequent detection-update events are generated
the flow is active for a long time (let's say an hour) and lot's of data is transferred; an update event is generated
after several hours, nDPId receives a TCP-FIN packet; an end event is generated

ToDo

I am trying to understand what you want to achieve. If I am right, you are only interested in detected and detection-update events and you want to free/delete the dictionary/map entry as soon as possible after the detection is done (successful or not) and no more dissection is possible. Unfortunately this is not possible right now, at least not in an "immediate" way.

But a solution to this could be to send an update event right after the flow state changed to finished. That way you know immediately that no more detection/dissection will ever happen with this flow. Would that be sufficient for you?

utoni commented 2 years ago

By the way: You can ignore the flow state skipped, it is only relevant for command line options -I and -E and I will rework this soon.

verzulli commented 2 years ago

Thanks for your description: very enlightning.

I've briefly expanded it:

nDPId receive a packet that is unrelated to existing flows (as for the ones being tracked/processed by it, at that point in time). As such, it starts tracking a new flow and a new event is generated;
as soon as additional packets belonging to previous flow are received, nDPId properly process related payload to detect layer 7 protocol:

a) when such detection succeed, a detected event is generated. As an example, this happens when nDPI succesfully detect the SNI-negotiation of a TLS-connection;

b) (timeout? not-detected? guessed? idle?...)
while more packets arrive, nDPId is able to dissect more data to improve the overall visibility of the flow. Should it succeed in such improvement, one or more detection-update events are generated. As an example, this happens in TLS-connections, when the server-certificate is sent to the client, after the SNI-negotiation;
if the flow is active for a long time (let's say an hour) and lots of data is transferred, an update event is generated;
when the flow is going to be terminated (eg.: a TCP-FIN or a TCP-RST is received for TCP-flow, or a DNS-reply is captured, for an UDP-flow) an end event is generated

Can you confirm it's correct?

More important, could you add a bit more details about idle, guessed, and not-detected events?

verzulli commented 2 years ago

I am trying to understand what you want to achieve.

I don't know, exactly, what I want to achieve.

What I want to achieve (at the moment), is:

"grabbing" the complete set of information related to "completed" flows (the whole set of JSONs related to those flows);
extract from those flow-JSONs a specific subset of information (eg: duration, number of packets, start and stop time, etc.) and push them in some external engine (elasticsearch/opensearch, for example);
apply some high-level visualization.

Again: I really don't know what I'm looking for. But as soon as I'll have a clear idea about the relationship between "events" and traffic (flow-traffic, I mean...), I bet I'll be able to start digging deeper with the analysis and... should be able to be more specific.

utoni commented 2 years ago

Thanks for your description: very enlightning.

I've briefly expanded it:

1. `nDPId` receive a packet that is unrelated to existing flows (as for the ones being tracked/processed by it, at that point in time). As such, it starts tracking a new flow and a `new` event is generated;

Correct.

2. as soon as additional packets belonging to previous flow are received, `nDPId` properly process related payload to detect layer 7 protocol:
   a) when such detection succeed, a `detected` event is generated. As an example, this happens when nDPI succesfully detect the SNI-negotiation of a TLS-connection;

Correct

   b) (_timeout_? _not-detected_? _guessed_? _idle_?...)

_timeout_ == _idle_ ;) For the rest, see below.

3. while more packets arrive, `nDPId` is able to dissect more data to improve the overall visibility of the flow. Should it succeed in such improvement, one or more `detection-update` events are generated. As an example, this happens in TLS-connections, when the server-certificate is sent to the client, **after** the SNI-negotiation;

Correct.

4. if the flow is active for a long time (let's say an hour) and lots of data is transferred, an `update` event is generated;

Correct.

5. when the flow is going to be terminated (eg.: a TCP-FIN or a TCP-RST is received for TCP-flow, or a DNS-reply is captured, for an UDP-flow) an `end` event is generated

Those end events are only generated for TCP flows, because UDP is a datagram oriented protocol and thus can time out (idle) but not end.

Can you confirm it's correct?

More important, could you add a bit more details about idle, guessed, and not-detected events?

Let's treat detected, not-detected and guessed as a oneOf relation.

"oneOf": {
  "detected": "Layer7 protocol was succeeded.",
  "not-detected": "Layer7 protocol was not detected, either because flow end/idle or max-packets-per-flow-to-process reached.",
  "guessed": "Layer7 protocol was not detected, either because flow end/idle or max-packets-per-flow-to-process reached. But IP/Port based detection succeeded."
}

Similar to end and idle.

"oneOf": {
  "end": "TCP only; TCP-FIN or TCP-RST seen.",
  "idle": "Layer4 specific timeout reached",
}

verzulli commented 2 years ago

state_diagram

Here I am, again. Can you check if this schema is right?

In detail, may I assume that:

no event will be emitted BEFORE a NEW;
no event will be emitted AFTER a NOT-DETECTED, a GUESSED, a IDLE and a END
if I receive a DETECTION-UPDATE, than I surely received a previous DETECTED
after a DETECTED or a DETECTION-UPDATE, I'll surely receive some other events

My only problem, now, is related to UPDATE: where exactly does it fit, in the above schema?

utoni commented 2 years ago

If this "flow" chart is finished, I would appreciated a PR for README.md. =)

Here I am, again. Can you check if this schema is right?

In detail, may I assume that:
* no event will be emitted _BEFORE_ a `NEW`;

* no event will be emitted _AFTER_ a `NOT-DETECTED`, a `GUESSED`, a `IDLE` and a `END`

not-detected and guessed behaves like detected:

new ---> not-detected ----------> idle
     `-> guessed ------>'    `--> end

* if I receive a `DETECTION-UPDATE`, than I surely received a previous `DETECTED`

* after a `DETECTED` or a `DETECTION-UPDATE`, I'll surely receive some other events
My only problem, now, is related to UPDATE: where exactly does it fit, in the above schema?

update is a special case. It can occur anywhere between new and idle / end.

utoni commented 2 years ago

There is by the way a new flow event called analysis (default disabled) which aims to provide me with extracted features required for ML.

verzulli commented 2 years ago

There is by the way a new flow event called analysis (default disabled) which aims to provide me with extracted features required for ML.

I often spend time thinking to ML (indeed: Unsupervised Learning) applied to network traffic. I'd bet that "flows" (Netflow initially; nDPI flows, currently) could have a role there. But I'm still "investigating" (as I'm NOT a ML expert...)

My point, however, is that the "features" to extract from nDPI flows are "hard" to be handled directly within nDPId main engine. Hence, I'm a bit skeptical about analysis events, as they --I bet...-- need to compromise between richness (...so to be helpful for training) and small/quick (...so to not kill nDPId engine).

That's exactly the reason why I'm detaling reconstructing the nDPId event diagram: once I'm sure to have collected ALL the events regarding a flow (aka: once I'm sure to NOT receive further events regarding a flow), than I can heavily work on extracting information from those flows and... use those information to train something.

But details are missing. Lots of details are missing (in my brain...).

verzulli commented 2 years ago

Here is an updated version. Please, review it and... as soon as it will be "finished" (and EXACT), than I'll publish/link it somewhere within the README state_diagram

utoni commented 2 years ago

I often spend time thinking to ML (indeed: Unsupervised Learning) applied to network traffic. I'd bet that "flows" (Netflow initially; nDPI flows, currently) could have a role there. But I'm still "investigating" (as I'm NOT a ML expert...)

Same goes for me. I have no experience with Netflow and just basic ML knowledge.

My point, however, is that the "features" to extract from nDPI flows are "hard" to be handled directly within nDPId main engine. Hence, I'm a bit skeptical about analysis events, as they --I bet...-- need to compromise between richness (...so to be helpful for training) and small/quick (...so to not kill nDPId engine).

True. It consumes slightly more memory and cpu. For that reason, it is disabled per default.

That's exactly the reason why I'm detaling reconstructing the nDPId event diagram: once I'm sure to have collected ALL the events regarding a flow (aka: once I'm sure to NOT receive further events regarding a flow), than I can heavily work on extracting information from those flows and... use those information to train something.

But details are missing. Lots of details are missing (in my brain...).

There are some minor things missing; not-detected is missing an edge to end guessed is missing an edge to idle

Not sure how we can include the analysis event, since this event is sent for every flow after a fixed amount of captured flow packets. So I guess it should be placed close to update in your graph.

verzulli commented 2 years ago

Here we are, again. Could this be OK? state_diagram

P.S.: I added a title and the attribution. Hope this is OK for you...

utoni commented 2 years ago

Looks pretty good. Just one small remark: TCP flows can also time out (idle). If, for whatever reason, the client and server immediately stop sending and receiving packets, the flow needs to time out at some point to keep consistency.

utoni commented 2 years ago

(the correct event is called analyse, but I am open for better naming)

verzulli commented 2 years ago

I updated the schema (here is the final one) and prepared and opened a PR. flow_events_diagram

So... I'm going to close this issue! Thanks for your support!

utoni / nDPId

Details about "Flow STATES" and "Flow EVENTS" #7

Flow States

Flow Events

ToDo