Timestamps output for -l switch

MarcinNaw commented 7 years ago

Currently, the -l switch outputs a nice CSV file containing the following columns:

['sampleType', 'agentAddress', 'inputPort', 'outputPort', 'src_MAC', 'dst_MAC', 'ethernet_type', 'in_vlan', 'out_vlan', 'src_IP', 'dst_IP', 'IP_protocol', 'ip_tos', 'ip_ttl', 'src_port', 'dst_port', 'tcp_flags', 'packet_size', 'IP_size', 'sampling_rate']

Unfortunately, it does not print the unixSecondsUTC field of the samples. I think, knowing the timestamp of the packets would be very beneficial for further analysis.

Is an extension possible, maybe by another switch -l -t?

Greets.

sflow commented 7 years ago

There is a plan to add a "-j" option which would output in JSON format, with a separate { ... } object on each row, representing one flow-sample or one counter-sample. I think that should take over from the -l option. Will that work for you too?

MarcinNaw commented 7 years ago

Yes, as long as the timestamp will be included in the json output, this will suffice.

(In my case it will be not as convenient as reading in just a CSV file, but that's just programming details on my site)

When will be the -j option will be available?

openbsod commented 6 years ago

Thank you for planned JSON-output addition, waiting for this update too.

powernap commented 6 years ago

I have a fork with -L (vs -l) implemented and compiling on TrueOS (FreeBSD) but I don't have the ability right now to test it. I have a volunteer testing it tomorrow morning and if that works I'll be submitting a PR as soon as I can.

sflow commented 6 years ago

Thanks for submitting the pull-request.

Did you consider justing writing out the number as an integer in unix-seconds? Since this feed is typically going to be consumed by a script then it seems likely that work will have to be done in that script to parse the timestamp string back into binary form.

Let's consider a real-world use-case, what do you do with it yourself?

(Sorry for the long delay in adding the JSON-encoded output. I think that might still be the better way forward.)

powernap commented 6 years ago

TL;DR

I could go either way
I think CSV users mostly expect ISO 8601-ish timestamps vs. UNIX timestamps
I think there are other options for those wanting UNIX timestamps
Having CSV output with SOME timestamps in the official sflowtool matters to me more than what format the timestamps are in

or, long form: I went back and forth over whether to use a ISO 8601-ish timestamp or just dump unix time, but it seems to me if you're going to be parsing stuff heavily/in an automated manner after collection, where UNIX timestamps are more desirable and portable, there are other output options in sflowtool for that. (Or the coming JSON output...)

In this case in particular, I'm trying to make it possible for testers doing SNIA Emerald testing to dump stats to a CSV file and then run a tagging tool I created (powernap/tag2014) to help do data reduction and arrive at average network throughput over ports for disjointed time series. I imagine most of this processing of the final output of my tool will be done in Excel (unfortunately).

The advantage of ISO 8601-ish timestamps is that:

my tool will autodetect the column via timestamp regex
those not using my tool, or using the output for general purposes, who want CSV are likely to be minimally processing the output (looking at it in excel, etc.) where an ISO 8601-ish timestamp is likely easier/more expected/etc.

I intend to extend my tool to convert the sflowtool output to rates anyway, so I'll be converting to and from time formats no matter what sflowtool outputs... if sflowtool outputs ISO 8601-ish times, then I only convert to UNIX timestamps for internal use, if sflowtool outputs UNIX time, I only convert to ISO 8601-ish timestamps for the final output.

powernap commented 6 years ago

Any further thoughts on whether the ISO 8601-ish timestamps are good enough or if reverting to UNIX time is required?

sflow commented 6 years ago

If we are going to define a "-L" option then it seems to me that it should do more than just add a timestamp. Especially when that timestamp is almost always going to be "now" (to the nearest second) and can usually be added by the next step in the pipeline:

sflowtool -l | awk -vOFS=, -- '{print strftime(),$0}'

What if the -L option took arguments to say exactly what fields should be included in the CSV output? Then you could do something like this - e.g. if you were just locating sources:

sflowtool -L "FLOW,unixSecondsUTC,agent,inputPort,srcMAC,srcIP,srcIP6"

Clearly it's more work. We would probably define a lookup-table of all fields and then turn the -L argument into a bitmask that applies to them. But the end result could be more digestible that the proposed JSON output, and could burn fewer cycles. Having a field-table would likely mop up some other loose ends too.

Does it seem like the above would work for your use-case?

powernap commented 6 years ago

I think this is a bit of overkill, since filtering down data to relevant fields is really the better job for awk. I just wanted to add timestamp data since it is data that sflowtool was receiving but not emitting.

I believe we are envisioning very different usages for the CSV output. In the use cases I'm trying to cover, the CSV output is simply collected to a file and then used for time-correlated analysis later. This is why the timestamps are so important, and the fact that there is extra data we may not care about is of little importance. For more details on the use case I am trying to cover, see this training slide deck starting at slide 19: https://www.snia.org/sites/default/files/emerald/Training/EmeraldTraining_Feb-Mar2018/SNIAEmeraldTraining_Feb-Mar2018_SPEC_SFS2014_Within_Emerald.pdf

Apologies for the delay in response - day job has been taking priority.

sflow commented 6 years ago

The timestamps in the sFlow packets reflect the various different clocks that are running out on the agents (switches and hosts), which may or may not be reliable. So from a systems perspective it is simpler if you can use just one clock in the system - the receiver's clock. Since the transit delays and UDP stack delays are relatively short, you can get away with this. So I think in this case it would just mean changing:

sflowtool -4 –L > c:\tmp\sflowdata.txt

to something like:

sflowtool -4 –l | awk -vOFS=, -- '{print strftime(),$0}' | > c:\tmp\sflowdata.txt

(or whatever timestamp format makes sense).

Of course, if the individual packet/counter samples in the sFlow feed came with accurate timestamps and there was an effective clock-sync protocol running too then it would make sense to include them. But that's a good example of the sort of thing that sFlow deliberately left out. Accurate timestamps may be valuable as an extension, but not valuable enough to be mandated in the base standard.

powernap commented 6 years ago

This seems like it is punishing those who actually setup their environment with proper time synchronization just because some people don't care to do it. Those who don't have proper time synchronization always have the ability to use awk to have a local time option, but why is that at the expense of dumping the available timestamp that is in the sflow sample data we are already dumping?

sflow commented 6 years ago

This 'uptime' timestamp has only millisecond resolution and only marks the send-time of the datagram. Unless you are sending over the WAN you will get it less than a millisecond later. An sFlow agent may choose to delay samples for up to 1000mS in order to fill a datagram (you can usually pack 5-10 samples into one datagram). So while it would probably have been OK to include it in the -l output from the start, you could also make the case that it would be misleading and redundant.

(A timestamp for each individual sample with nanosecond accuracy would be a different thing. It's been suggested before, but most agents would not be able to fill it in. It could be added as an extra optional structure that goes out with counter and flow samples.)

I actually quite like the idea of naming the fields you want, so please stay tuned for that in case I find the time...

powernap commented 6 years ago

I still think some timestamp included in CSV output is far better than nothing. The send time of the datagram sounds like a reasonable compromise given there is no other time data available, and in my opinion is far more philosophically correct than local time when the sample was received. I'm not the only person that feels that way, as I was not the one that opened the issue.

Would you include this timestamp data from the datagram in the JSON output that is planned? If so, why exclude it from the CSV output? The timestamps from the datagrams are dumped in the "grep-friendly" output, so why should CSV output be excluded?

MarcinNaw commented 6 years ago

I realize, this discussion is rather about which timestamps to use (sender or receiver, format) than about the specific CSV output mode. Maybe another issue is needed, since the timestamps are neglected in many output modes:

sflowtool -t -r ${sflow.pcap} | tcpdump -r - -w flows.pcap

This creates a pcap file, which neither takes ...

... the receivers clock saved directly in the pcap file
... nor the senders clock set in the sflow packets, that are archived in the pcap.

Just my two cents:

Using the timestamps of the agents would actually motivate administrators to fix their not synchronised clocks. I believe, the user usually expects to see the timestamp of the sensor (the agent). The senders clocks are in my case a complete mess.

Since the packets need to traverse the network from the agents to the receiver, the receiver clock also introduces a minor error due to this delay. As both options are not perfect, we could aim for both via a switch. Using the receivers clock should be easier to implement.

We should keep the Unix philosophy and build a good tool suited for only one task, which means we do not need any column selection as an output-option. However, we could solve this the same way as the bro community with another slicing tool such as bro-cut (and accept the performance hit of one more data cycle). I don't care about the timestamp format as long as it is not timezone-biased.

MarcinNaw commented 5 years ago

Hey folks,

if you want to have the timestamps of the receiver in your pcap-output, just change sflowtool.c:

Line 1432: (-) hdr.ts_sec = (uint32_t)time(NULL);
Line 1432: (+) hdr.ts_sec = (uint32_t)sample->pcapTimestamp;

This is enough for most of my use cases.

sflow-rt commented 5 years ago

I think it's worth clarifying the definition of the sender timestamps defined in sFlow version 5:

   unsigned int uptime;           /* Current time (in milliseconds since device
                                     last booted). Should be set as close to
                                     datagram transmission time as possible.
                                     Note: While a sub-agents should try and
                                           track the global sysUptime value
                                           a receiver of sFlow packets must
                                           not assume that values are
                                           synchronised between sub-agents. */

The uptime counter is not an absolute measure of time and is not useful for ordering metrics within or between agents.

Further, the spec states:

   While the sFlow Datagram structure permits multiple samples to be
   included in each datagram, the sFlow Agent must not wait for a buffer
   to fill with samples before sending the sFlow Datagram. sFlow is
   intended to provide timely information on traffic. The sFlow Agent
   may at most delay a sample by 1 second before it is required to send
   the datagram.

Network delays for UDP packets are typically orders of magnitude less than a second and can be ignored and so receiver timestamps are accurate to the second and provide a consistent representation of time across all agents and data sources.

It's much easier to rely on synchronized clocks on the sFlow receivers than sFlow agents which are often implemented on low cost embedded processors where clock drift can be significant. The clock drift means that time deltas computed based on uptime are also prone to large errors compared to receiver timestamps.

sflow commented 5 years ago

Changes have (finally) been checked in to master that add JSON output options.

sflowtool -j => compact JSON object per datagram separated by newline sflowtool -J => indented JSON object per datagram separated by empty line

Hopefully this has been done successfully without disturbing the other output formats, but please test and let me know.

I will review this issue thread to see if there is a missing time-related field. Should be no problem to add.

sflow commented 5 years ago

I also checked in changes to support a custom line-by-line CSV output where you can specify the fields you want, in the order you want. Like this:

sflowtool -4 -L localtime,srcIP,dstIP,IPProtocol,ip.tot_len

The tokens it expects here are the ones you see as field tags in the JSON output. I think this last feature finally addresses the original request, so I'm going to close this thread. However please test and let me know if it does the job.

sflow / sflowtool

Timestamps output for -l switch #12