logstash-plugins / logstash-codec-netflow

Apache License 2.0
78 stars 87 forks source link

Configs for Netflow (Citrix netscalers) #54

Closed gcyre closed 7 years ago

gcyre commented 8 years ago

I'm trying to setup logstash to receive metrics from our netscalers and am having issues getting the right settings.

My config looks like this: input { udp { port => 5043 codec => netflow { versions => [5,9,10] netflow_definitions => "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-netflow-3.1.2/lib/logstash/codecs/netflow/ipfix.yaml" target => ipfix } type => ipfix } }

When I run: /opt/logstash/bin/logstash agent -f /etc/logstash/conf.d/logstash-netscaler.conf —configtest

I get No matching template for flow id 258 {:level=>:warn} No matching template for flow id 262 {:level=>:warn} No matching template for flow id 280 {:level=>:warn} No matching template for flow id 257 {:level=>:warn}

I've updated logstash to the latest: logstash --version logstash 2.4.0

and updated the plugin logstash-codec-netflow (3.1.2)

Not 100% sure I have the right setup

thanks Garry

wrigby commented 8 years ago

@gcyre How long have you let it run for? Netflow/Ipfix work by sending out a template message periodically (usually 30ish seconds to a minute) which describes the format of the individual flow reports. Because of this, Logstash won't be able to parse the flow reports from the router until it gets the template message.

The 'No matching template' messages in the first minute or so of runtime are normal and expected, but if they continue, or there are other errors, then there's probably a bigger problem to look into.

gcyre commented 8 years ago

Hi @wrigby I let it run for a couple of hours and those warnings never went away. Someone on our team is working through the code and templates this week. Hoping we have a working solution this week.

thanks for your help

jorritfolmer commented 8 years ago

@gcyre Any update on this?

jorritfolmer commented 7 years ago

If you can provide me with a small pcap sample, I'll look into it. My email is in my profile. Alternatively if you managed to fix it yourself, any feedback is also greatly appreciated ❤️

gjmoed commented 7 years ago

@jorritfolmer, I am/was working on the netscaler support with @gcyre. Briefly had to switch to a different internal project though so things stalled a bit. Looking to return to this project shortly and wrap things up. I see I have to rebase for your last merge but that should be trivial looking at the changes. But more importantly, though we took things through some prod testing already and the code runs great, I still need to add 2 more tests to have complete coverage here. My work will include the var length support which is also waiting here in PR's from others, though one of them is seriously failing and should not be merged, I'd have to check which one though. On the other hand, since I did include it as well, since it's needed for netscaler support, and I improved things a little bit, maybe the others will no longer be needed. I also included tests for var length support but still need to anonymize the included data a bit :-) Anyways, long story short, hoping to have this out the door into a PR within the next week or so. Apologies for the delay.

jorritfolmer commented 7 years ago

Awesome!

If it helps, I use this script to anonymise .pcaps that I replay to Logstash:

#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

if [ -z "$3" ]; then
    echo usage $0 file org_ip new_ip
    exit 1
fi
if [ ! -f "$1" ]; then
    echo Not found: $1
    exit 2
fi
#
# PREP: set vars and convert ip's to network order hex
#
FILE=$1
ORG_IP=$(perl -MSocket -wE 'say unpack "H*", inet_aton ($ARGV[0]);' $2)
NEW_IP=$(perl -MSocket -wE 'say unpack "H*", inet_aton ($ARGV[0]);' $3)
#
# OPTIONAL: fix source and dest for replay to Logstash vm
#
bittwiste -I ${FILE} -O ${FILE}_ -T eth -s 00:0c:29:70:86:09 -d 00:0c:29:b9:5c:8f 
bittwiste -I ${FILE}_ -O ${FILE}__ -T ip -s 172.16.32.201 -d 172.16.32.202 
bittwiste -I ${FILE}__ -O ${FILE} -T udp -d 2055
rm -f ${FILE}__
rm -f ${FILE}_
#
# MAIN: search and replace IP in payload, then have bittwiste recalc checksum
#
xxd -ps -c 20000 ${FILE} | sed "s/${ORG_IP}/${NEW_IP}/g" |xxd -r -ps -c 20000 >${FILE}_
bittwiste -I ${FILE}_ -O ${FILE} -T udp -d 2055
rm -f ${FILE}_
adelbot commented 7 years ago

Hello all,

I have the problem

netflow.zip

gjmoed commented 7 years ago

@jorritfolmer, tnx for the script, nice hints, I need to replace much more though, various urls and other text. I doubt we need to be concerned with checksums though but I'll find out soon enough :-)

@adelbot, what are you trying to say, the zip contains a pcap, what is 'the problem'? :-)

gjmoed commented 7 years ago

@jorritfolmer, ok, finally got to work on this a bit more; rebased, finished all code and tests, however, still need to anonymize the data and then account for changed outcomes in tests. This however should be fairly trivial. I do have a couple days off now though so expect my PR by coming Monday please. Hopefully you can hold any bigger changes/merges until then, hint hint :-) Tnx!

jorritfolmer commented 7 years ago

Sure, thx for the update!

jorritfolmer commented 7 years ago

@adelbot Hi thanks for your pcap. However it doesn't contain any templates emitted from the Netscaler. Could you try capturing a little longer? Your current pcap spans 20 seconds. 120 seconds would be better to make sure you also get the templates too.

adelbot commented 7 years ago

@jorritfolmer this new pcap file is correct. All the fields are OK in Wireshark. But i don't understand how generate my netscaler yaml file to export the records.

ipfix.zip

gjmoed commented 7 years ago

@adelbot, patience is a virtue, coming Monday :-) btw, iirc, by default Netscalers export templates every 600 secs...

adelbot commented 7 years ago

@gjmoed TKS

gjmoed commented 7 years ago

@jorritfolmer, @adelbot, please see/review #60, Tnx

adelbot commented 7 years ago

Great ! I'll make new tests tomorrow on netscaler's test

TKS @gjmoed

adelbot commented 7 years ago

@gjmoed : It's look like GOOD !!!!!! Many Tks

A small remark: could not we eliminate the empty fields in the json ?

BR AL1

gjmoed commented 7 years ago

@adelbot, according to spec, no, the templates dictate the fields in use for the data set. However, I can imagine we can introduce a simple switch for skipping 'empty' fields, though we'd have to think about consequences for logstash processing. Wonder what happens if you somehow do fields processing but the field doesn't even come in. Then again, guess you could also check for that in your processing as well. Anyways, is it really that important though, how is this bugging you?

adelbot commented 7 years ago

No problem with the pluggin. So many difficulties to extract intelligent data in Kibana

gjmoed commented 7 years ago

There are various ways of trying to deal with the huge amounts of data, for instance something like this:

filter {
  if [type] == "ipfix" {
    ruby {
      code => "
        whitelist_names = ['flowset_id', 'sourceIPv4Address', 'destinationIPv4Address', 'netscalerRoundTripTime', 'netscalerClientRTT', 'netscalerHttpRspLen', 'netscalerHttpRspStatus', 'netscalerTransactionId', 'sourceTransportPort', 'destinationTransportPort', 'netscalerServerTTFB', 'netscalerServerTTLB', 'netscalerHttpReqUrl', 'netscalerHttpReqHost', 'netscalerHttpReqMethod', 'netscalerHttpReqXForwardedFor', 'netscalerHttpDomainName']
        event.set('netflow', event.get('netflow').select {|k,_| whitelist_names.include?(k)})
      "
    }
  }

You get the idea ;-)

adelbot commented 7 years ago

Yes ;-)

Is there a timestamp in the netflow protocol ?

gjmoed commented 7 years ago

Not sure what sorta timestamp you're looking for; there's the logstash '@timestamp' of course, and then depending on the netflow version and/or template in use, in the case of Netscalers for instance: flowStartMicroseconds and flowEndMicroseconds....

However, let me ask you this, what are you looking to accomplish? Reason I'm asking, don't count on the Netscaler sending flows in 'sequence' or proper order.

Allow me to give another hint which may be of help:

    if [netflow][flowset_id] {
      if [netflow][flowset_id] == 258 and [netflow][netscalerHttpReqMethod] in ['GET', 'POST'] {
        aggregate {
          task_id => "%{[netflow][netscalerTransactionId]}"
          code => "some code"
          map_action => "create"
        }
      } else if [netflow][flowset_id] == 257 {
        aggregate {
          task_id => "%{[netflow][netscalerTransactionId]}"
          code => "some more code, updating whatever you started above"
          map_action => "update"
        }
      } else if [netflow][flowset_id] == 262 {
        if [netflow][netscalerHttpRspLen] > 0 {
          aggregate {
            task_id => "%{[netflow][netscalerTransactionId]}"
            code => "mostly final code to update and finish whatever you started above"
            map_action => "update"
            push_map_as_event_on_timeout => true
            #end_of_task => true
            timeout => 60
            timeout_code => "your last bits of code"
          }
        }
      }
    }

The above should give you direction how one can go about using the proper sequencing, relying on the new switch I added for ipfix, enabling the flowset_id, plus the presence of the netscalerTransactionId. Do note however, I've worked with this a little bit myself and I'm not happy with the current state of the aggregate plugin since it doesn't expose the logstash event for updating until you hit the 'timeout_code' whereas I rather not await a timeout since I know when a flow ends. I might have to contribute some to the aggregate filter or I may have to write something from scratch myself, not sure yet. If you must know, I used the above to create a new event, for a different(new) index, which I'd then also use for output to Influxdb. In Influxdb we'd then graph http resp codes per endpoint and TTFB and TTLB, etc etc...

adelbot commented 7 years ago

Tks for this. But how aggregate work with many logstash in // ?

gjmoed commented 7 years ago

"many logstash in"? As in, multiple netflow inputs from various devices? Use your imagination and either do further filtering based on origin or use the origin as part of the same filter. Note again though, I didn't find the aggregate plugin really perfect for this, though it's the one that comes closest to wanting to do something with the sequencing of flows. Thing is, the Netscaler information is sorta scattered across various flows with different flow ids, hence the need to sequence them and compile the info upon receiving final flows. My hint was to show you as much and get you into the same direction I'm working. I haven't come up with the perfect solution either just yet, will work on that again in the new year :-) That may involve contributing to the aggregate plugin or ditching logstash and simply write something from scratch for getting flows into influx, which was really the ask at the company I work for ;-)

Anyways, this conversation really hasn't much to do with the logstash netflow codec anymore, though I understand we're all after trying to make the most of the data it exposes.

Please do let me know if there are still issues/challenges with the netflow codec itself and/or possibly still missing field types for your particular flow sets in use, I'll be happy to look at them and fix things wherever I can. Cheers!

@jorritfolmer, maybe close this issue since it likely no longer is an issue :-)

adelbot commented 7 years ago

last question :

why the timestamp is in seconds ? Is-it the timestamp when we write in ES or the timestamp when we read the flow ?

BR

gjmoed commented 7 years ago

If you look at the source :-)

LogStash::Event::TIMESTAMP => LogStash::Timestamp.at(flowset.unix_sec)

Where the flowset.unix_sec comes from the Netflow PDU (v5, v9 and v10). See here for example.

This timestamp can be different from the recorded flow start and/or flow end of course.

So to answer your question, it does come from your exporting device so it's not directly related to when we process things in logstash.

And if you're really curious, Netflow v5 also does unix_nsec and hence the timestamp processing looks slightly different:

LogStash::Event::TIMESTAMP => LogStash::Timestamp.at(flowset.unix_sec.snapshot, flowset.unix_nsec.snapshot / 1000)
adelbot commented 7 years ago

ok, fine.

I don't understand the netflow.flowStartMicroseconds/netflow.flowEndMicroseconds format :

504353-06-13T02:04:35.841Z/504353-06-13T02:04:35.841Z

gjmoed commented 7 years ago
          when /^flow(?:Start|End)(Milli|Micro|Nano)seconds$/
            divisor =
              case $1
              when 'Milli'
                1_000
              when 'Micro'
                1_000_000
              when 'Nano'
                1_000_000_000
              end
            event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot.to_f / divisor).to_iso8601

And then here for the format.

However! You may have uncovered something weird, the output doesn't look quite right and may hint at a wrong divisor in use. The weird thing is, I never really bother either since I even took the same weird values into my spec testing for this :( I'll have a closer look...

gjmoed commented 7 years ago

Okay, finally got a chance to look into this some more. It's rather awkward since the above 'original' code was never going to work. I mention 'original' as I never touched these parts. And somehow the incorrect interpretation never got tripped because nobody ever stumbled upon any microsecond fields? Nor is there any spec testing for these ipfix microseond fields until I submitted my Netscaler patch. However, I went wrong by ignoring the output value and simply copy pasting whatever was found. Now you asking about this, @adelbot, made me look closer and notice the value doesn't make any sense at all.

Anyways, the problem: One can not keep dividing further and further, it actually stops at the Millisecond and thus dividing by 1000 (from a 64 bit integer). As soon as we hit Micro and Nano seconds, it's a completely different story. We then need to split the 64 bits into the first 32 bits being the datetime in seconds and the last 32 bits being the fractions of a second. Those can then be given to the LogStash::Timestamp and then converted to iso8601:

LogStash::Timestamp.at(seconds, micros).to_iso8601

Which, coincidentally can also be seen in the netflow v5 and v9 decoding.

So the code part I quoted above this post needs patching. I'll work on that as soon as I find a bit of time and create a PR ;-) In case someone else beats me to it, feel free, and do take into account the correct logic will trip the rspec because I copied in completely insane values without even looking.

jorritfolmer commented 7 years ago

Yes exactly and thanks for the analysis. That code was never covered by any rspec as we're rather light on IPFIX samples. Indeed the closest we get in the IPFIX tests is systemInitTimeMilliseconds or flowStartMilliseconds.

Thanks again for your continued contributions.

gjmoed commented 7 years ago

@adelbot, @jorritfolmer, please see #61 which should hopefully fix things with the funky microseconds timestamps. Now the rspec test values make sense as well :-)

There's still some debate going around the interwebs though since Citrix may have messed up their implementation of the what should be an NTP timestamp. However, we better stick to the strict NTP timestamp interpretation and only have things affect the fraction part for Netscalers so other manufacturers remain unaffected. It's really only the fraction, wonder how precise things are anyways, which also makes you wonder why we'd need microseconds resolution in the first place.

Anyways, long story short, this should get things in the right direction.

adelbot commented 7 years ago

Yes, the date format is correct but i have 2 questions

Is it a date UTC (it seems) ? Why we loose the microseconds, all date is XXX-XX-XXTXX:XX:XX.000Z ?

Tks for the job

jorritfolmer commented 7 years ago

@adelbot Yes UTC: https://tools.ietf.org/html/rfc7011#section-6.1.9. Further reading from @gjmoeds commit: https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=11047. Closing this issue.