vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.66k stars 1.56k forks source link

Significantly update MMDB geoip enrichment data returns #19995

Closed johnhtodd closed 7 months ago

johnhtodd commented 7 months ago

A note for the community

Use Cases

The current methodology around using GeoIP lookups with MMDB files seems to be quite limited in capability, and does not allow for custom files which contain field names that are not hard-coded in Vector source code. We perform several MMDB lookups on our pipeline (three in some cases) and having this decreased to just one lookup event would be ideal. To do that, we would need to "merge" the two MMDB files with each other, and then Vector would need to be flexible enough to return an arbitrary set of fields with the single lookup. Also, we find that the existing MMDB lookups do not return a full set of data when that would be sometimes useful.

The existing MMDB lookup method is quite inflexible, and forces specific choices of what fields are returned, and even looks at file names to make assumptions about what kind of fields should be sent back in the response. This does not allow us to extend existing MMDB tables with more information without really ugly overloading of field names, and even that is not workable past a certain point.

Attempted Solutions

Tried overloading field names; this is awful and has a limited ramp as more complexity encroaches. Also, there is no other way to get other fields without code changes.

Proposal

We would propose a much more generic method of doing MMDB lookups and data returns, which uses the format of the MMDB file to define the data structure handed back. This seems like it would be less work than the existing method, so I'm not sure what was the intent of the original effort that resulted in the way it works today. There are field names implicit within the MMDB file, and this could simply be exported as a Vector "default" object set into memory. This would be much more flexible by allowing arbitrary database fields to be named, and then the management and parsing of those fields would be done within Vector as all fields are currently managed.

Here would be an example future configuration idea:

enrichment_tables:
  GeoIP2-City:
     path: /etc/vector/maxmind/GeoIP2-City_20231117/GeoIP2-City.mmdb
     type: geoip2

transforms:
 .  .  . 
  .GeoData, err = get_enrichment_table_record("GeoIP2-City", { "ip":  ."sourceAddress" }  
 .  .  . 

So a lookup for (as an example) 128.151.224.17 would result in an object from the MMDB lookup in the GeoIP2-City.mmdb file inserted into the .GeoData object that resembles this:

{
  "city": {
    "geoname_id": "5112703",
    "names": {
      "en": "Churchville"
    }
  },
  "continent": {
    "code": "NA",
    "geoname_id": 6255149,
    "names": {
      "de": "Nordamerika",
      "en": "North America",
      "es": "Norteamérica",
      "fr": "Amérique du Nord",
      "ja": "北アメリカ",
      "pt-BR": "América do Norte",
      "ru": "Северная Америка",
      "zh-CN": "北美洲"
    }
  },
  "country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  },
  "location": {
    "accuracy_radius": 20,
    "latitude": 43.078,
    "longitude": -77.8375,
    "metro_code": 538,
    "time_zone": "America/New_York"
  },
  "postal": {
    "code": "14428"
  },
  "registered_country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  },
  "subdivisions": [
    {
      "geoname_id": 5128638,
      "iso_code": "NY",
      "names": {
        "de": "New York",
        "en": "New York",
        "es": "Nueva York",
        "fr": "New York",
        "ja": "ニューヨーク州",
        "pt-BR": "Nova Iorque",
        "ru": "Нью-Йорк",
        "zh-CN": "纽约州"
      }
    }
  ]
}

(Note: a developer of ours wrote some python code to produce this output directly from MMDB files - https://github.com/sbng/mrt2mmdb/ - use the "lookup.py" script like this: "python3 lookup.py --mmdb /etc/vector/maxmind/GeoIP2-City_20231117/GeoIP2-City.mmdb --ipaddress 128.151.224.17") - you may need to alter single quotes to double quotes. You an also use mmdbctl (https://github.com/ipinfo/mmdbctl) for queries.)

This would be a breaking change, and is different enough to the old method that it would seem to warrant a new enrichment method entirely. I proposed "geoip2" but perhaps "mmdb" needs to start appearing in the name to be less ambiguous.

References

Additionally: there was discussion of the use of "rayon" to increase performance in the original issue (https://github.com/vectordotdev/vector/issues/847) - was that used? It was more than 3x improvement in speed. Our reason for looking at custom MMDB files is to summarize multiple lookups into one, or at most two lookups for the sake of speed, and anything else to make MMDB lookups faster would be appreciated.

Historical: Original PR: https://github.com/vectordotdev/vector/pull/1015 Further work: https://github.com/vectordotdev/vector/issues/1372

Version

vector 0.37.0 (x86_64-unknown-linux-gnu e2d8ad4 2024-03-02 04:01:47.120115067)

jszwedko commented 7 months ago

Thanks for the detailed request @johnhtodd ! I'm not terribly familiar with the mmdb format, but my quick investigation of the format seems to validate what you've written: records can have arbitrary fields. I think the difficulty would be that it appears that the names of the fields aren't embedded the database structure itself, from what I can see, so I think this would need to be provided to Vector. All that the database has to indicate the structure of each record is the the database_type field. I agree that this is different enough from the geoip lookups that it could be a different enrichment table type (mmdb).

The mmdb reader we use has the geoip database record structures "hardcoded" for each GeoIP database type (city, ISP, etc.): https://github.com/oschwald/maxminddb-rust/blob/9045bfe1ac1fa3e7a29258692a03ea6ab2e01069/src/maxminddb/geoip2.rs#L5-L90. That's how we can pull out the fields with names like autonomous_system_number.

Additionally: there was discussion of the use of "rayon" to increase performance in the original issue (https://github.com/vectordotdev/vector/issues/847) - was that used? It was more than 3x improvement in speed. Our reason for looking at custom MMDB files is to summarize multiple lookups into one, or at most two lookups for the sake of speed, and anything else to make MMDB lookups faster would be appreciated.

The benchmark mentioned in that issue was just demonstrating that executing lookups in parallel with rayon is faster than serial lookups. This actually should already be happening by virtue of Vector executing remap transforms concurrently.

johnhtodd commented 7 months ago

I saw that comment from earlier in a prior MMDB thread about field names not being included in MMDB files - I don't think that is correct. The MMDB files I'm using here certainly do have the field names embedded in them, so I don't know where that impression came from. The parsers we wrote (as well as mmdbctl) create the structure of the file from the embedded field names for output.

Example proofs (these are field names of an example GeoIP2-City database):

oot@dev01:/etc/vector/maxmind/GeoIP2-City_20231117# strings GeoIP2-City.mmdb |grep geoname_id
DcodeBASJgeoname_id
root@dev01:/etc/vector/maxmind/GeoIP2-City_20231117# strings GeoIP2-City.mmdb |grep longitude
4Ilongitudeh@\n5?|
root@dev01:/etc/vector/maxmind/GeoIP2-City_20231117# strings GeoIP2-City.mmdb |grep accuracy_radius
Oaccuracy_radius
root@dev01:/etc/vector/maxmind/GeoIP2-City_20231117#

...and for a custom MMDB that we built with mrt2mmdb, using an unmodified version of mmdbctl to export the data - first line is the field names, and there is no magic location where mrt2mmdb is getting this field name data, since we arbitrarily created the file with field names "path" and "prefix" which don't appear in any MMDB libraries - those names are being pulled from the MMDB file (easier to show this than the code that we used to generate the file, which I linked above):

root@dev01:/tmp# mmdbctl export ams-20230306.out.mmdb|more
range,autonomous_system_number,autonomous_system_organization,path,prefix
1.0.0.0/24,13335,CLOUDFLARENET,42 13335,1.0.0.0/24
1.0.4.0/24,38803,Wirefreebroadband Pty Ltd,42 3356 174 7545 2764 38803,1.0.4.0/22

On the rayon topic: OK, great - so rayon is not relevant since Vector is already parallelizing these events. I didn't have context on that. Thanks!

In short: I think this can be done entirely self-contained with just the MMDB file as the single input, which could then define a full object of results including all key names.

jszwedko commented 7 months ago

The MMDB files I'm using here certainly do have the field names embedded in them, so I don't know where that impression came from.

This impression comes from the spec doc they have: https://maxmind.github.io/MaxMind-DB/#:~:text=a%20double%20instead.-,Data%20Field%20Format,-Each%20field%20starts

Maybe the geoip databases are all storing a "map" as the record though which would allow for key/values. For the custom databases is that what you are doing? Using a "map" as the record?

johnhtodd commented 7 months ago

I'm not sure how that works (can't really dig into the code right now; I didn't write our parser) but I know that mmdbctl and our tools can both suss out the field names from the MMDB files without any difficulty or sidecar information, from both MMDB files that we create with custom names, as well as the files provided by MaxMind. I'm hopeful that this is enough to allow single-file consumption/mapping of names without a sidecar map file or schema definition built into the config in Vector, but we'll figure that out as we build the patch.

johnhtodd commented 7 months ago

A comment on configuration: I am very suspect of code that uses the hardcoded filename provided as the method to determine what kind of file is being accessed. Perhaps making those assumptions based on a dot-separated three letter suffix, but basing parsing logic on the starting part of the name seems very fragile and highly mis-interpretable. Currently, the patch looks at the filename and says "If the file doesn't contain the starting characters of "GeoLite2-ASN", "GeoIP2-ISP", "GeoIP2-Connection-Type", or "GeoIP2-City" then it must be a special custom file and we'll treat it differently. Even though this new special method should be able to ingest any of those other types equally well, or be more complete than the current method. Could we create a new "type" indicator that doesn't care about mmdb filenames? Perhaps saying "type: custom-mmdb" would allow us to use this new parsing method on any file, no matter what it has for a name, and would make documentation more easily explicable as well.

You asked about the intent of the original request. I haven't (yet) tested the patch and I'm somewhere that I can write comments but not run code, so I'll spend some time doing that instead to explain the intent a bit more by way of examples, and you can see if what has been written matches the hoped-for outcome.

Based on your example file, which I'll simplify here:

/tmp# mmdbctl export custom-type.mmdb
range,hostname,nested
8.8.8.0/24,google,"{""hostname"":""google"",""original_cidr"":""8.8.8.8/24""}"
208.192.1.0/24,vectortest,"{""hostname"":""vectortest"",""original_cidr"":""208.192.1.2/24""}"

...I would expect this configuration:

enrichment_tables:
  CustomStuff:
     path: /tmp/custom-type.mmdb
     type: custom-mmdb

transforms:
  . . .
    .geoData, err = get_enrichment_table_record("CustomStuff", { "ip":  "8.8.8.8" }  
 . . . 

...to result exactly in this data:

.geoData.hostname = "google"
.geoData.nested.hostname = "google"
.geoData.nested.original_cidr = "8.8.8.8/24"

Here is a snippet of one of our custom MMDB files that has BGP data in it, along with some path data that we customized:

/tmp/mrt# mmdbctl export ams-20230306.out.mmdb|more
range,autonomous_system_number,autonomous_system_organization,path,prefix
1.0.0.0/24,13335,CLOUDFLARENET,42 13335,1.0.0.0/24
1.0.4.0/24,38803,Wirefreebroadband Pty Ltd,42 3356 174 7545 2764 38803,1.0.4.0/22
1.0.5.0/24,38803,Wirefreebroadband Pty Ltd,42 2914 6453 7545 2764 38803,1.0.5.0/24

So using this MMDB file in the "CustomStuff" enrichment, I'd get this when Iooked up 1.0.0.1:

.geoData.autonomous_system_number = "13335"
.geoData.autonomous_system_organization = "CLOUDFLARENET"
.geoData.path = "42 13335"
.geoData.prefix = "1.0.0.0/24"

and another, more complex example from Maxmind with the same lookup, if I specified the GeoIP2-City.mmdb file as a custom-mmdb input (note that this is a slightly different result than what happens today with the filename-specific ingestion model):

/tmp/GeoIP2-City_20231117# mmdbctl export GeoIP2-City.mmdb |more
range,city,continent,country,location,postal,registered_country,subdivisions
1.0.0.0/24,"{""geoname_id"":2158177,""names"":{""de"":""Melbourne"",""en"":""Melbourne"",""es"":""Melbourne"",""fr"":""Melbourne"",""ja"":""メルボルン"",""pt-BR"":""Melbourne"",""ru"":""Мельбурн"",""zh-CN"":""墨尔本""}}","{""code"":""OC"",""geoname_id"":6255151,""names"":{""de"":""Ozeanien"",""en"":""Oceania"",""es"":""Oceanía"",""fr"":""Océanie"",""ja"":""オセアニア"",""
pt-BR"":""Oceania"",""ru"":""Океания"",""zh-CN"":""大洋洲""}}","{""geoname_id"":2077456,""iso_code"":""AU"",""names"":{""de"":""Australien"",""en"":""Australia"",""es"":""Australia"",""fr"":""Australie"",""ja"":""オーストラリア"",""pt-BR"":""Austrália"",""ru"":""Австралия"",""zh-CN"":""澳大利亚""}}","{""accuracy_radius"":1000,""latitude"":-37.5297,""longitude"":144.9586,"
"time_zone"":""Australia/Melbourne""}","{""code"":""3064""}","{""geoname_id"":2077456,""iso_code"":""AU"",""names"":{""de"":""Australien"",""en"":""Australia"",""es"":""Australia"",""fr"":""Australie"",""ja"":""オーストラリア"",""pt-BR"":""Austrália"",""ru"":""Австралия"",""zh-CN"":""澳大利亚""}}","[{""geoname_id"":2145234,""iso_code"":""VIC"",""names"":{""en"":""Victoria
"",""pt-BR"":""Vitória"",""ru"":""Виктория""}}]"

I would expect a lookup on 1.0.0.1 in that database to produce results that look like this (non-exhaustive example - the actual object would contain ALL results from the lookup but I only show a few for brevity):

.geoData.city.names.en = "Melbourne"
.geoData.continent.code = "OC"
.geoData.location.latitude = -37.5297
.geoData.location.longitude = 144.9586
.geoData.continent.names.en = "Oceania"
.geoData.subdivisions[0].names.en = "Victoria"
jszwedko commented 7 months ago

A comment on configuration: I am very suspect of code that uses the hardcoded filename provided as the method to determine what kind of file is being accessed. Perhaps making those assumptions based on a dot-separated three letter suffix, but basing parsing logic on the starting part of the name seems very fragile and highly mis-interpretable. Currently, the patch looks at the filename and says "If the file doesn't contain the starting characters of "GeoLite2-ASN", "GeoIP2-ISP", "GeoIP2-Connection-Type", or "GeoIP2-City" then it must be a special custom file and we'll treat it differently. Even though this new special method should be able to ingest any of those other types equally well, or be more complete than the current method. Could we create a new "type" indicator that doesn't care about mmdb filenames? Perhaps saying "type: custom-mmdb" would allow us to use this new parsing method on any file, no matter what it has for a name, and would make documentation more easily explicable as well.

I don't think the patch looks at the file name. What it does is read the database_type field from the mmdb database. It looks like you can do mmdbctl metadata <file> to see the database type.

I do see your point about reading the record as a map being more general than specialized code for each database though. However, I don't think we could replace the current per-database-type handling with it as it is not simply reading a map, but it does some additional processing. For example, finding the last subdivision and renaming that to "region":

https://github.com/vectordotdev/vector/pull/20054/files#diff-961d6c19250ef7e564ca8d1e4fd8680c1ac36e5c32d955202d6aa07f5980ba84R207-L208

We could, however, let the user configure if they just want to read the structure as-is though and not do any mapping, even for known database types. A couple of ways we could do this:

I'm partial to the new enrichment table type since it keeps things nicely separated, but I worry about users being confused about whether to use mmdb or geoip2 for their database 🤔

esensar commented 7 months ago

I don't think the patch looks at the file name. What it does is read the database_type field from the mmdb database. It looks like you can do mmdbctl metadata <file> to see the database type.

Correct. It used to check that before as well, I have just changed it to treat unknown types as custom, instead of city.

We could, however, let the user configure if they just want to read the structure as-is though and not do any mapping, even for known database types. A couple of ways we could do this:

* A new `mmdb` enrichment table type that just reads the records as-is (this is what you suggest above)

* A `mmdb_type` field on the `geoip2` enrichment table type that could be used to override the database type. For example, it could be set to `custom` to read the structure as-is

I'm partial to the new enrichment table type since it keeps things nicely separated, but I worry about users being confused about whether to use mmdb or geoip2 for their database 🤔

I think first option makes more sense. If we go for that, maybe it would make sense to make it an error if any unsupported type is used for geoip2 (instead of reading it as city type). Overriding mmdb_type seems like only custom would make sense, since other options would likely result in errors.

johnhtodd commented 7 months ago

I'm all for the first option. It would permit non-breaking use of the existing model, but would also allow new custom mmdb files to be read, and would also thirdly allow standard Maxmind files to be read and interpreted in the same way as custom files if that was the intent of the administrator.

jszwedko commented 7 months ago

I'm 👍 on pursuing the first option to introduce a new enrichment table type: mmdb that would live alongside the geoip2 enrichment table type. We should just make it clear in the docs when the mmdb enrichment table type should be used instead of the geoip2 enrichment table type.

If we go for that, maybe it would make sense to make it an error if any unsupported type is used for geoip2 (instead of reading it as city type).

I tend to agree.