monogon-dev / NetMeta

NetMeta is a scalable network observability toolkit optimized for performance.
https://netmeta.demo.monogon.dev/
Apache License 2.0
133 stars 8 forks source link

ASN Statistics broken #72

Open vidister opened 2 years ago

vidister commented 2 years ago

In our setups with exports from Junos based routers (using both sflow and netflow) the default dashboard doesn't display any ASN statistics. When I remove the AND FlowDirection = conditions from the queries everything works fine. Maybe our routers don't add FlowDirection Attributes to the flowsamples? Is it possible to remove the condition?

leoluk commented 2 years ago

That's odd - if you query some raw data, does it have any FlowDirection values at all?

kubectl exec -it chi-netmeta-netmeta-0-0-0 -c clickhouse -- clickhouse-client <<< 'select * from flows_raw limit 10 format JSONEachRow'
vidister commented 2 years ago

Okay, this is a funny one: We have two distinct (or kinda related) problems here.

Junos IPFIX

On the IPFIX Setup I get flow entries like this:

{
  "Date": "2021-10-25",
  "FlowType": "FLOWUNKNOWN",
  "SequenceNum": "3014",
  "TimeReceived": "1635174093",
  "SamplingRate": "2000",
  "FlowDirection": 255,
  "SamplerAddress": "REDACTED",
  "TimeFlowStart": "1635174074",
  "TimeFlowEnd": "1635174074",
  "Bytes": "52",
  "Packets": "1",
  "SrcAddr": "REDACTED",
  "DstAddr": "REDACTED",
  "EType": 2048,
  "Proto": 6,
  "SrcPort": REDACTED,
  "DstPort": REDACTED,
  "InIf": 528,
  "OutIf": 552,
  "SrcMac": "0",
  "DstMac": "0",
  "SrcVlan": 101,
  "DstVlan": 0,
  "VlanId": 101,
  "IngressVrfId": 0,
  "EgressVrfId": 0,
  "IPTos": 0,
  "ForwardingStatus": 0,
  "IPTTL": 55,
  "TCPFlags": 16,
  "IcmpType": 0,
  "IcmpCode": 0,
  "IPv6FlowLabel": 0,
  "FragmentId": 0,
  "FragmentOffset": 0,
  "BiFlowDirection": 0,
  "SrcAS": REDACTED,
  "DstAS": REDACTED,
  "NextHop": "REDACTED",
  "NextHopAS": 0,
  "SrcNet": 15,
  "DstNet": 24
}

So, "FlowDirection": 255, on every entry. (WHERE FlowDirection != 255 returns 0 rows). There's this Blog post describing the issue:

They export 255 to avoid reporting the wrong flow direction when a packet is sampled by both ingress and egress PFE

https://www.plixer.com/blog/juniper-mx240-ipfix-support-direction-problems/

So an easy solution would be to match FlowDirection != 1 instead of FlowDirection == 0 and vice versa.

Junos sFlow

Then there's a second setup using sflow on JunOS. There are only entries with "FlowDirection": 0:

$ kubectl exec -it chi-netmeta-netmeta-0-0-0 -c clickhouse -- clickhouse-client <<< 'select count(*) from flows_raw WHERE FlowDirection == 0 limit 10 format JSONEachRow'
{"count()":"463967"}

$ kubectl exec -it chi-netmeta-netmeta-0-0-0 -c clickhouse -- clickhouse-client <<< 'select count(*) from flows_raw WHERE FlowDirection != 0 limit 10 format JSONEachRow'
{"count()":"0"}

It definitively is sampling both directions, so I don't know why it sets the FlowDirection to 0. I'll grab some pcaps and try to figure out what's going on there.

But even then it should display something on the SrcASN Graph, right? Well, just Reserved-ASN 0. This is because SrcAS and DstAS are both set to 0. So I guess we have to check if the value is zero and perform another lookup in the risinfo dict.

sFlow Dump:

{
  "Date": "2021-10-22",
  "FlowType": "FLOWUNKNOWN",
  "SequenceNum": "51378",
  "TimeReceived": "1634864823",
  "SamplingRate": "4000",
  "FlowDirection": 0,
  "SamplerAddress": "REDACTED",
  "TimeFlowStart": "1634864823",
  "TimeFlowEnd": "1634864823",
  "Bytes": "1498",
  "Packets": "1",
  "SrcAddr": "REDACTED",
  "DstAddr": "REDACTED",
  "EType": 2048,
  "Proto": 17,
  "SrcPort": REDACTED,
  "DstPort": REDACTED,
  "InIf": 542,
  "OutIf": 508,
  "SrcMac": "REDACTED",
  "DstMac": "REDACTED",
  "SrcVlan": 10,
  "DstVlan": 1,
  "VlanId": 0,
  "IngressVrfId": 0,
  "EgressVrfId": 0,
  "IPTos": 0,
  "ForwardingStatus": 0,
  "IPTTL": 63,
  "TCPFlags": 0,
  "IcmpType": 0,
  "IcmpCode": 0,
  "IPv6FlowLabel": 0,
  "FragmentId": 9561,
  "FragmentOffset": 0,
  "BiFlowDirection": 0,
  "SrcAS": 0,
  "DstAS": 0,
  "NextHop": "::",
  "NextHopAS": 0,
  "SrcNet": 0,
  "DstNet": 0
}
leoluk commented 2 years ago

Thanks for debugging!

So an easy solution would be to match FlowDirection != 1 instead of FlowDirection == 0 and vice versa.

Happy to implement this - if I understood the linked article correctly, this would result in correct-ish data by counting all ingress/egress traffic in both tables, right?

I suppose we could also implement #60 and use interface IDs to figure out the flow direction instead, which would also solve the problem with sFlow where no FlowDirection is included.

We already do something similar for the other graphs: if(FlowDirection == 1, 'out', 'in') AS FlowDirection

But even then it should display something on the SrcASN Graph, right? Well, just Reserved-ASN 0.

It definitely should do exactly that, this is what it looks like on one of my sFlow samplers:

image

The solution here is to fill in ASN data using the risinfo dict at capture time to have proper historic data - this is already on the short-term backlog and shouldn't be hard to do.

vidister commented 2 years ago

Happy to implement this - if I understood the linked article correctly, this would result in correct-ish data by counting all ingress/egress traffic in both tables, right?

this is correct.

We already do something similar for the other graphs: if(FlowDirection == 1, 'out', 'in') AS FlowDirection

Yes, in that case every flow is labeled as "in" right now, which can be a bit confusing. What are we doing with these? We could put a third string "Unknown" in there, but I think that could also be confusing in the dashboard. Maybe leaving the string empty could do the job? Still a bit messy when mixed sources with differently broken flow export implementations are involved, but probably as good as it gets...

leoluk commented 2 years ago

Sounds to me like the "correct" solution is to fix up the data at ingestion time. Your data seems to have correct InIf and OutIf data, so presumably one could deduce the correct flow direction from that?

vidister commented 2 years ago

But how do you know if a interface is a edge-port or some internal/backbone interface? Set a flag in the interfaceMap?

leoluk commented 2 years ago

Yup, that was the idea - just have a map of all interfaces and which way they're facing, possibly determined from Netbox and/or SNMP. Does that sound workable?

vidister commented 2 years ago

Not nice, but probably the best solution.

leoluk commented 2 years ago

Yeah... I don't think we can avoid it unless the device tells us the physical flow direction, which it doesn't want to...

How about having a list of "local" CIDR ranges and using that to determine direction? AS won't work but IPs might.

vidister commented 2 years ago

This would work fine for Hosting-Provider-Like networks but not so well for ISP networks with downstream ASNs (like ours.). I think fastnetmon solves this by receiving the routes via BGP... which seems a bit overkill here.

leoluk commented 2 years ago

Hmm...we could do that! No half measures 😆

How would that look like - use BGP/BMP to figure out local networks?

vidister commented 2 years ago

Yup. BGP/BMP integration could become useful anyways. We could work with BGP communities.

leoluk commented 2 years ago

Okay, let's do that - sounds like the "correct" solution.

(was meaning to have BMP support anyway to get AS path and avoid the risinfo trick)

We could work with BGP communities.

i.e. have a config setting which communities mark "local" networks?

vidister commented 2 years ago

i.e. have a config setting which communities mark "local" networks?

yes. And we can generalize this to use other communities as well for filtering.. For example customer networks, region/city communities, etc.

fionera commented 2 years ago

The ASN Lookup is now fixed on the development branch by #87 / #88 . Can the issue then be closed or do we still have the FlowDirection issue?

leoluk commented 2 years ago

This is now implemented, thanks again :)

fionera commented 1 year ago

The FlowDirection issue is still open :/

fionera commented 1 year ago

The issue with the FlowDirection is partly happening on portmirror deployments too, because the flow data that gets ingested only has one InIf/OutIf set. Because of this, all Graphs have the sum of all traffic from the other direction displayed. We should probably find a way to infer it ourselves since it is the broader solution to issues like this