Closed verzulli closed 2 years ago
Thank you for your question. It shows me that I have to improve documenting my work. You can treat flow states and flow events separately.
You can skip this part most likely. For TL;DR: They are not really useful and just a representation of nDPId
's internal flow processing states.
Example (TCP connection):
nDPId
receives a TCP-SYN
: it allocates memory for internal flow tracking and libnDPI
packet processing.nDPId
set's the state to info
.nDPId
processes the next incoming packets until at least one of the following conditions are true:
max-packets-per-flow-to-process
is reached, defaults to NDPI_DEFAULT_MAX_NUM_PKTS_PER_FLOW_TO_DISSECT
and can be set via -o max-packets-per-flow-to-process=number
nDPId
sets the state to finished
The most interesting part.
Example (TCP connection):
nDPId
receives a TCP-SYN
: new
event is generatednDPId
was able to detect the layer 7 protocol: detected
event is generatednDPId
is able to dissect more data: subsequent detection-update
events are generatedupdate
event is generatednDPId
receives a TCP-FIN
packet; an end
event is generatedI am trying to understand what you want to achieve. If I am right, you are only interested in detected
and detection-update
events and you want to free/delete the dictionary/map entry as soon as possible after the detection is done (successful or not) and no more dissection is possible. Unfortunately this is not possible right now, at least not in an "immediate" way.
But a solution to this could be to send an update
event right after the flow state changed to finished
. That way you know immediately that no more detection/dissection will ever happen with this flow. Would that be sufficient for you?
By the way: You can ignore the flow state skipped
, it is only relevant for command line options -I
and -E
and I will rework this soon.
Thanks for your description: very enlightning.
I've briefly expanded it:
nDPId
receive a packet that is unrelated to existing flows (as for the ones being tracked/processed by it, at that point in time). As such, it starts tracking a new flow and a new
event is generated;
as soon as additional packets belonging to previous flow are received, nDPId
properly process related payload to detect layer 7 protocol:
a) when such detection succeed, a detected
event is generated. As an example, this happens when nDPI succesfully detect the SNI-negotiation of a TLS-connection;
b) (timeout? not-detected? guessed? idle?...)
while more packets arrive, nDPId
is able to dissect more data to improve the overall visibility of the flow. Should it succeed in such improvement, one or more detection-update
events are generated. As an example, this happens in TLS-connections, when the server-certificate is sent to the client, after the SNI-negotiation;
if the flow is active for a long time (let's say an hour) and lots of data is transferred, an update
event is generated;
when the flow is going to be terminated (eg.: a TCP-FIN or a TCP-RST is received for TCP-flow, or a DNS-reply is captured, for an UDP-flow) an end
event is generated
Can you confirm it's correct?
More important, could you add a bit more details about idle
, guessed
, and not-detected
events?
I am trying to understand what you want to achieve.
I don't know, exactly, what I want to achieve.
What I want to achieve (at the moment), is:
Again: I really don't know what I'm looking for. But as soon as I'll have a clear idea about the relationship between "events" and traffic (flow-traffic, I mean...), I bet I'll be able to start digging deeper with the analysis and... should be able to be more specific.
Thanks for your description: very enlightning.
I've briefly expanded it:
1. `nDPId` receive a packet that is unrelated to existing flows (as for the ones being tracked/processed by it, at that point in time). As such, it starts tracking a new flow and a `new` event is generated;
Correct.
2. as soon as additional packets belonging to previous flow are received, `nDPId` properly process related payload to detect layer 7 protocol: a) when such detection succeed, a `detected` event is generated. As an example, this happens when nDPI succesfully detect the SNI-negotiation of a TLS-connection;
Correct
b) (_timeout_? _not-detected_? _guessed_? _idle_?...)
_timeout_ == _idle_
;)
For the rest, see below.
3. while more packets arrive, `nDPId` is able to dissect more data to improve the overall visibility of the flow. Should it succeed in such improvement, one or more `detection-update` events are generated. As an example, this happens in TLS-connections, when the server-certificate is sent to the client, **after** the SNI-negotiation;
Correct.
4. if the flow is active for a long time (let's say an hour) and lots of data is transferred, an `update` event is generated;
Correct.
5. when the flow is going to be terminated (eg.: a TCP-FIN or a TCP-RST is received for TCP-flow, or a DNS-reply is captured, for an UDP-flow) an `end` event is generated
Those end
events are only generated for TCP flows, because UDP is a datagram oriented protocol and thus can time out (idle
) but not end
.
Can you confirm it's correct?
More important, could you add a bit more details about
idle
,guessed
, andnot-detected
events?
Let's treat detected
, not-detected
and guessed
as a oneOf relation.
"oneOf": {
"detected": "Layer7 protocol was succeeded.",
"not-detected": "Layer7 protocol was not detected, either because flow end/idle or max-packets-per-flow-to-process reached.",
"guessed": "Layer7 protocol was not detected, either because flow end/idle or max-packets-per-flow-to-process reached. But IP/Port based detection succeeded."
}
Similar to end
and idle
.
"oneOf": {
"end": "TCP only; TCP-FIN or TCP-RST seen.",
"idle": "Layer4 specific timeout reached",
}
Here I am, again. Can you check if this schema is right?
In detail, may I assume that:
NEW
;NOT-DETECTED
, a GUESSED
, a IDLE
and a END
DETECTION-UPDATE
, than I surely received a previous DETECTED
DETECTED
or a DETECTION-UPDATE
, I'll surely receive some other eventsMy only problem, now, is related to UPDATE
: where exactly does it fit, in the above schema?
If this "flow" chart is finished, I would appreciated a PR for README.md. =)
Here I am, again. Can you check if this schema is right?
In detail, may I assume that:
* no event will be emitted _BEFORE_ a `NEW`; * no event will be emitted _AFTER_ a `NOT-DETECTED`, a `GUESSED`, a `IDLE` and a `END`
not-detected
and guessed
behaves like detected
:
new ---> not-detected ----------> idle
`-> guessed ------>' `--> end
* if I receive a `DETECTION-UPDATE`, than I surely received a previous `DETECTED` * after a `DETECTED` or a `DETECTION-UPDATE`, I'll surely receive some other events
My only problem, now, is related to
UPDATE
: where exactly does it fit, in the above schema?
update
is a special case. It can occur anywhere between new
and idle
/ end
.
There is by the way a new flow event called analysis
(default disabled) which aims to provide me with extracted features required for ML.
There is by the way a new flow event called
analysis
(default disabled) which aims to provide me with extracted features required for ML.
I often spend time thinking to ML (indeed: Unsupervised Learning) applied to network traffic. I'd bet that "flows" (Netflow initially; nDPI flows, currently) could have a role there. But I'm still "investigating" (as I'm NOT a ML expert...)
My point, however, is that the "features" to extract from nDPI flows are "hard" to be handled directly within nDPId main engine. Hence, I'm a bit skeptical about analysis
events, as they --I bet...-- need to compromise between richness (...so to be helpful for training) and small/quick (...so to not kill nDPId engine).
That's exactly the reason why I'm detaling reconstructing the nDPId event diagram: once I'm sure to have collected ALL the events regarding a flow (aka: once I'm sure to NOT receive further events regarding a flow), than I can heavily work on extracting information from those flows and... use those information to train something.
But details are missing. Lots of details are missing (in my brain...).
Here is an updated version. Please, review it and... as soon as it will be "finished" (and EXACT), than I'll publish/link it somewhere within the README
I often spend time thinking to ML (indeed: Unsupervised Learning) applied to network traffic. I'd bet that "flows" (Netflow initially; nDPI flows, currently) could have a role there. But I'm still "investigating" (as I'm NOT a ML expert...)
Same goes for me. I have no experience with Netflow and just basic ML knowledge.
My point, however, is that the "features" to extract from nDPI flows are "hard" to be handled directly within nDPId main engine. Hence, I'm a bit skeptical about
analysis
events, as they --I bet...-- need to compromise between richness (...so to be helpful for training) and small/quick (...so to not kill nDPId engine).
True. It consumes slightly more memory and cpu. For that reason, it is disabled per default.
That's exactly the reason why I'm detaling reconstructing the nDPId event diagram: once I'm sure to have collected ALL the events regarding a flow (aka: once I'm sure to NOT receive further events regarding a flow), than I can heavily work on extracting information from those flows and... use those information to train something.
But details are missing. Lots of details are missing (in my brain...).
There are some minor things missing;
not-detected
is missing an edge to end
guessed
is missing an edge to idle
Not sure how we can include the analysis
event, since this event is sent for every flow after a fixed amount of captured flow packets. So I guess it should be placed close to update
in your graph.
Here we are, again. Could this be OK?
P.S.: I added a title and the attribution. Hope this is OK for you...
Looks pretty good. Just one small remark: TCP flows can also time out (idle
). If, for whatever reason, the client and server immediately stop sending and receiving packets, the flow needs to time out at some point to keep consistency.
(the correct event is called analyse
, but I am open for better naming)
I updated the schema (here is the final one) and prepared and opened a PR.
So... I'm going to close this issue! Thanks for your support!
As for "Flow STATES" and "Flow EVENTS" (whose description is reported in the README), I'm trying to better understand what exactly they means.
After running for ~24 hours my realtime analyzer receiving the UDP-stream of an nDPId instance running at the border of a small set of VPSs, I got these numbers:
please note that I'm interested ONLY on "flows" tracking.
As you can see, I got 1.735.579 messages ("in"), succesfully processed as JSONs ("ok"), with zero errors ("err").
From those JSONs the analyzer skipped the 2111 JSONs NOT related to flows, and focused to the others 1.733.468.
From those 1.733.468 flow JSONs, it extracted "flow_state" and "flow_event_name", combining them in a string and counting related groups.
With a
show counter
I got the numbers of occurrencies of those strings and, as you can see, I got:whose sum is exactly 1.733.468
I'm trying to figure out the state-diagram used by nDPI, to understand exactly what's the event (and the state) that signal the termination of the activities performed by nDPI. I guess it's "finished/end".... but a "info/end" makes me in trouble :-(
I scratched down following diagram:
Could you be so kind to explain me WHICH EVENT I should focus, to let me know when exactly nDPI will finish processing flow... so that I can expect that no other events, related to that flow, will be received by my analizer?
At the moment, I'm keeping track of "everything", with an always-increasing memory-map of EVERY flows. What I want to achieve is it EXTRACT "completed flows" from such a table and forward them to next processing stage
Sorry if this sounds a bit cumbersome: I understand I'm not exactly clear with this request.... :-(