redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.14k stars 840 forks source link

add processor "GeoLite2" #277

Open DpoBoceka opened 5 years ago

DpoBoceka commented 5 years ago

Sometimes, if we have IP addresses in our messages (especially if we are triaging web-server's logs) we want them to be enriched with geoip database, like this one:

https://dev.maxmind.com/geoip/geoip2/geolite2/

And here is a reader to it:

https://github.com/oschwald/maxminddb-golang

What do you think, should we expand benthos with such functionality? But of course, we are able to insert all that data into some cache or sql and utilise processors which we already have, but that would be more of a workaround. Implementing this would mean another point's taken from a logstash.

Jeffail commented 5 years ago

Hey @DpoBoceka, seems like a reasonable addition.

DpoBoceka commented 4 years ago

I'm on it. I think, on an advanced stage of implementing this we should have some sort of cahe_size like they use in Logstash, because it would be a waste to lookup some addresses every time all over again. Or perhaps, linux filesystem's cache would manage that and no overhead occurred. Any word of advise?

Jeffail commented 4 years ago

Don't worry for now, eventually we can add a cache field to optionally point to a cache resource.

jamesharr commented 3 years ago

Do you think this is going to make its way into benthos? Is there any work I can help with?

Jeffail commented 3 years ago

Hey @jamesharr, my plan was to adapt the processor from the existing PR into a bloblang method as it'd make it easier to compose but it's taking me a while to get around to it. If you're interested in having a go that'd be awesome, just let me know if I can help.

jamesharr commented 3 years ago

Hello Jeffail, I'm struggling to get started with this one. I took a wrong turn somewhere learning the code-base and I think I need to set it down for a little bit and pick it up again.

What all do I need to do create a bloblang method? Is there a good example I can base some work off of?

In part, it's been a long time since I've written Go, but I also think my lack of Benthos experience probably isn't helping here. Any pointers would be helpful, thanks!

jamesharr commented 3 years ago

Hi @Jeffail,

So I have a "hello world" bloblang functioning, but not anything super useful at the moment.

I'm wondering a few things:

  1. What do you think the appropriate API would look like?

On the API topic, which makes more sense to you?

root.geo_city = this.ip_address.geoip_city()
root.geo_city.country.iso_code // == "US"
root.geo_city.country.name // == 'United States'
root.geo_city.city.name // == "Minneapolis"
// other fields as noted in https://github.com/maxmind/GeoIP2-python#city-database

root.geo_asn = this.ip_address.geoip_asn()
root.geo_asn.autonomous_system_number // == "1211"
root.geo_asn.autonomous_system_organization // == "Telstra Pty Ltd"

or how about this API?

root.geo_city = geoip_city(this.ip_address)
root.geo_asn = geoip_asn(this.ip_address)
  1. I'm not sure how to open (and keep open) the GeoIP file. This is probably where I'll need a pointer and/or example if there is one.
Jeffail commented 3 years ago

hey @jamesharr, I would suggest taking a string argument for a file path. The constructor of a bloblang function/method gets called only once when the value is static, so in the case of something like foo.bar("baz") the method bar is only created once and called many times, so you can simply read the file and not worry about caching the result or anything, similar to the file function: https://github.com/Jeffail/benthos/blob/master/internal/bloblang/query/functions.go#L320

And I think we ought to go with the method approach as it generally looks cleaner when put at the end of a long coersion/coalesce chain:

root.foo = this.(bar | baz).string().trim().geoip_city(path: "./something/db.zip")

In my opinion looks cleaner than:

root.foo = geoip_city(ip_address: this.(bar | baz).string().trim(), path: "./something/db.zip")

Having said all that, there's a few caveats that ought to be addressed, I'll take care of these myself afterwards just noting here for future reference:

jamesharr commented 3 years ago

Here's my first-pass at getting a .geo_city structure.

https://github.com/Jeffail/benthos/pull/866/files

It seems to work so far, but it's missing a lot of polish. A few questions...

Blobl example:

        root = this
        let geoip_data = this.ip.geoip_city(path: "GeoLite2-City.mmdb")
        root.geoip_data = $geoip_data
        root.city_name = $geoip_data.City.Names.en # this always returns null

Output (for 2001:4860:4860::8844 / dns.google)

{
  "geoip_data": {
    "City": {
      "GeoNameID": 0,
      "Names": null
    },
    "Continent": {
      "Code": "NA",
      "GeoNameID": 6255149,
      "Names": {
        "de": "Nordamerika",
        "en": "North America",
        "es": "Norteamérica",
        "fr": "Amérique du Nord",
        "ja": "北アメリカ",
        "pt-BR": "América do Norte",
        "ru": "Северная Америка",
        "zh-CN": "北美洲"
      }
    },
    "Country": {
      "GeoNameID": 6252001,
      "IsInEuropeanUnion": false,
      "IsoCode": "US",
      "Names": {
        "de": "USA",
        "en": "United States",
        "es": "Estados Unidos",
        "fr": "États-Unis",
        "ja": "アメリカ合衆国",
        "pt-BR": "Estados Unidos",
        "ru": "США",
        "zh-CN": "美国"
      }
    },
    "Location": {
      "AccuracyRadius": 100,
      "Latitude": 37.751,
      "Longitude": -97.822,
      "MetroCode": 0,
      "TimeZone": "America/Chicago"
    },
    "Postal": {
      "Code": ""
    },
    "RegisteredCountry": {
      "GeoNameID": 6252001,
      "IsInEuropeanUnion": false,
      "IsoCode": "US",
      "Names": {
        "de": "USA",
        "en": "United States",
        "es": "Estados Unidos",
        "fr": "États-Unis",
        "ja": "アメリカ合衆国",
        "pt-BR": "Estados Unidos",
        "ru": "США",
        "zh-CN": "美国"
      }
    },
    "RepresentedCountry": {
      "GeoNameID": 0,
      "IsInEuropeanUnion": false,
      "IsoCode": "",
      "Names": null,
      "Type": ""
    },
    "Subdivisions": null,
    "Traits": {
      "IsAnonymousProxy": false,
      "IsSatelliteProvider": false
    }
  },
  "ip": "2001:4860:4860::8844"
}