tumi8 / vermont

Vermont (VERsatile MONitoring Toolkit) is an open-source software toolkit for the creation and processing of network flow data.
https://www.net.in.tum.de/research/software/#vermont
GNU General Public License v2.0
58 stars 22 forks source link

Vermont does not respect flowKey/nonFlowKey configuration #112

Open nickbroon opened 6 years ago

nickbroon commented 6 years ago

While examining the code related to flow aggregation I'm not sure I understand how flowKey/nonFlowKey interacts with the aggregation. I would assume that any field configured as nonFlowKey should be aggregated, and those configured as flowKey not aggregated.

The FlowHashtable::aggregateFlow function that performs the actual aggregation does not appear to take the key status of field into consideration, basing the choice to aggregate on the return value of isToBeAggregated() which appears to use a fixed table of `type.id' to determine this.

flowKey/nonFlowKey appears to only be used in AggregatorBaseCfg::readNonFlowKeyRule and AggregatorBaseCfg::readFlowKeyRule to set ruleField->modifier = Rule::Field::AGGREGATE or ruleField->modifier = Rule::Field::KEEP and then ruleField->modifier is used in FlowHashtable::copyData while building a flow for consideration of inserting/aggregating into the hashtable, but AGGREGATE and KEEP are not treated any different.

I simply don't see how flowKey/nonFlowKey configuration is effecting how flows are aggregated together when flow is found in the hash table.

(Originally discussed here: https://github.com/tumi8/vermont/issues/108#issuecomment-389514397)

muenz commented 6 years ago

I think when the aggregators were implemented, they were not supposed to be that flexible to support arbitrary fields as flow key or non-flow key fields. I admit that the configuration is confusing, and the documentation suggests more than Vermont can provide.

In practice, however, this limitation is of little relevance. If packet header fields like IP addresses, port numbers etc. are configured for a flow record, they are always keys. If not, you would need to come up with an aggregation scheme to aggregate different IP addresses, for example. If you need this, feel free to implement it. On the other hand, attributes like packet size are typically summed up and not used as a key. So, they are always "aggregated".

Maybe it is easier to to correct the documentation of Vermont :)

nickbroon commented 6 years ago

The problem comes when some field other than than traditional 5 tuple key fields is configured as nonFlowKey, for example in my case things like output interface or applicationID, then these are treated as flowKey, without any error/warning given, which results in drastically more flows being created than expected. If the behaviour is to remain as it is, as well as updating the documentation I think the config system needs updated, to either remove the flowKey/nonFlowKey options and replace with something a simple list of fields that is desired to collect, or if flowKey/nonKey is to remain in the config then an error/warning printed when a field given does not match the current fixed list of nonFlowKey fields that are aggregated. I think better yet would be to actually implement the flowKey/nonFlowKey behaviour. As you mentioned many fields don't have a meaningful aggregation semantic (unlike say packet size which is simple accumulated, or tcp flags that are or'd together), but the traditional semantic given to these (and what Cisco/Juniper Netflow/IPFIX probes do) is to simple take the value from the first packet in the flow.

nickbroon commented 5 years ago

The default aggregation behaviour for information elements configured as 'non-key' should be to take the value from the first packet/flow.

From RFC 6728:

For example, if a non-key field specifies an Information Element
   whose value is determined by the first packet observed within a Flow
   (which is the default rule according to [RFC5102] unless specified
   differently in the description of the Information Element), this
   field MUST be included in the resulting Flow Record if it can be
   determined from the first packet of the Flow.

That is https://github.com/tumi8/vermont/blob/26d486439dd7057cfa79620ea573af3a050df1f4/src/modules/ipfix/aggregator/BaseHashtable.cpp#L329 BaseHashtable::isToBeAggregated() should consider configuration instead of a hard list of supported fields. (that is this function simply returns if field is configured as key or non-key). And importantly https://github.com/tumi8/vermont/blob/aceda692cfb6166e2d53ab1f24b2cd14ddb21bd9/src/modules/ipfix/aggregator/PacketHashtable.cpp#L1124 PacketHashtable::aggregateField() changed to have the the default behaviour of using the first field for aggregation. After which the the change in #114 to check the configuration can be removed.