quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.03k stars 670 forks source link

IP Address Field - Fast field [RFC] #1383

Closed fulmicoton closed 1 year ago

fulmicoton commented 2 years ago

We would like support to store IP Addresses. See https://github.com/quickwit-oss/quickwit/issues/1381

These IP address should allow indexed / stored and fastfields with the same semantics as integer. This RFC is about the fast field implementation.

IP address should all be mapped to IpV6 (IpV4 addresses can be mapped using Ipv4 mapped address.

IPv6 can in turn be handled as u128. 128 bit per address is however inefficient. For internal traffic, the cardinality of ip addresses is likely very low, and dictionary encoding seems like a great solution. For external traffic, on the other hand, the cardinality of IP-addresses can be huged. The values taken are however not uniformly reparted.

The following describes a simple encoding that has the following property

IPs are special in that real life IP Address tend to not be uniformly spread. The idea would then be to:

Once the compact space is defined, decoding and encoding to the compact space is a matter of running a binary search to locate the right interval.

Identifying the right compact space can be done iteravely by sorting the input, and removing the largest delta (in other word, "holes").

Note

The approach only works on the representation of IP addresses and does not take benefit of the distribution of frequency of the set of IP addresses.

kstaken commented 2 years ago

Not sure if this is helpful but I took a look at some metrics to see counts and uniqueness ratios for ipv4 and ipv6 addresses in one of our target data sets. This data has 2 IP addresses in each record and if ip1 is ipv4 then ip2 will also be ipv4 but the uniqueness will vary. These are internet addresses. Ran two samples at 1,000,000 and 10,000,000 records.

This is the output from QPL queries run against the ES cluster array looking at a sample from a 1 hour window. These numbers would be representative of a per split target for a Quickwit use case on this data.

{
  "total": 1,
  "returning": 1,
  "results": [
    {
      "record_count": 1000000,
      "ip1_v6_unique_ratio": 0.033065,
      "ip1_v4_unique_ratio": 0.069054,
      "ip2_v6_unique_ratio": 0.032135,
      "ip2_v4_unique_ratio": 0.146869,
      "unique_ip1_v4": 69054,
      "unique_ip2_v6": 32135,
      "unique_ip1_v6": 33065,
      "unique_ip2_v4": 146869,
      "ip1_v4": 565050,
      "ip2_v4": 565050,
      "ip1_v6": 434950,
      "ip2_v6": 434950
    }
  ]
}
{
  "total": 1,
  "returning": 1,
  "results": [
    {
      "record_count": 10000000,
      "ip1_v6_unique_ratio": 0.0077621,
      "ip1_v4_unique_ratio": 0.0256247,
      "ip2_v6_unique_ratio": 0.0094701,
      "ip2_v4_unique_ratio": 0.0987462,
      "unique_ip1_v4": 256247,
      "unique_ip2_v6": 94701,
      "unique_ip1_v6": 77621,
      "unique_ip2_v4": 987462,
      "ip1_v4": 5737512,
      "ip2_v4": 5737512,
      "ip1_v6": 4262488,
      "ip2_v6": 4262488
    }
  ]
}

NOTE: the total field uniqueness at 10M records is 3.2% for ip1 and 10.7% for ip2 and these ratios will obviously vary dramatically depending on the dataset but this dataset likely has higher ratios than you would typically find. Probably a worst case scenario for dictionary encoding.

fulmicoton commented 2 years ago

Thank you very much @kstaken. It's higher than what I thought. Maybe dictionary encoding is a bad idea after all.

fulmicoton commented 2 years ago

@PSeitz I updated the RFC for IP address field. I think the data suggested we do not need to go for the more complicated code to handle high-freq / low-freq IPs right away (or do we)?

EIther way that Codec is "special" in that it needs to make it possible to allow range queries. Can you design an API that would allow that?

PSeitz commented 2 years ago

@PSeitz I updated the RFC for IP address field. I think the data suggested we do not need to go for the more complicated code to handle high-freq / low-freq IPs right away (or do we)?

Not on kimbro's dataset, but on the dataset of the sec gov website it's beneficial. I would add the high-freq handling, in order to increase the maximum compression for high-freq cases. E.g. for cases where one IP makes 80% of the whole dataset, we could encode it with a single bit, instead the number of bits used by the compressed space.

EIther way that Codec is "special" in that it needs to make it possible to allow range queries. Can you design an API that would allow that?

Yes, I think it would look similar like FastFieldReader, but for u128.

fulmicoton commented 2 years ago

@PSeitz It will make the code considerably more complex. Running range query in particular will require running two ranges. Can we start with a simple codec and improve it later?

PSeitz commented 2 years ago

I'll just reserve same space in the codec to be able to handle that later on.

@kstaken What's the cardinality of the ip field, single or multi values?

kstaken commented 2 years ago

@PSeitz We have fields with both scenarios.

fulmicoton commented 2 years ago

@kstaken can you explain the use case for IP fields with multiple cardinality? I've never seen one.

kstaken commented 2 years ago

It's just a field that contains an array of IP addresses. We have some fields that just contain an ip address like ip:123.2.32.3 and some that contain arrays ips:['12.32.4.4', '123.0.2.9'] but a field is always one or the other not mixed across records. A scenario where you might see this in something like an HTTP log would be the proxy chain in X-Forwarded-For.

Are you not planning to generally support multivalue fields? ES handles this pretty seamlessly and doesn't require any special treatment at search or indexing time. It even allows you to mix single value and array values for the same field across records but that is best avoided for other reasons.

PSeitz commented 2 years ago

@kstaken We plan to support both and it should also be seamless. The current fast field API requires to define Cardinality, but we consider to drop that and detect it instead.

PSeitz commented 1 year ago

Cardinality detection will be part of the next release