opensearch-project / geospatial

Future home of Geospatial features for OpenSearch
Apache License 2.0
33 stars 34 forks source link

[BUG] ip2geo does not provide information from the database automatically #678

Open sasha2484 opened 2 weeks ago

sasha2484 commented 2 weeks ago

ip2geo does not provide information from the database automatically

How can one reproduce the bug? I used this instruction to set up: https://opensearch.org/docs/2.16/ingest-pipelines/processors/ip2geo/

  1. I have created a data source and verified that it works PUT /_plugins/geospatial/ip2geo/datasource/my-datasource { "endpoint" : "https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json", "update_interval_in_days" : 1 }

{ "acknowledged": true }

GET /_plugins/geospatial/ip2geo/datasource/my-datasource

{ "datasources": [ { "name": "my-datasource", "state": "AVAILABLE", "endpoint": "https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json", "update_interval_in_days": 1, "next_update_at_in_epoch_millis": 1724839387155, "database": { "provider": "maxmind", "sha256_hash": "t7FahuRg6Pjw+kcP0F29ZFAni4HEbX5WJC+1M38hzLU=", "updated_at_in_epoch_millis": 1724427053000, "valid_for_in_days": 30, "fields": [ "country_iso_code", "country_name", "continent_name", "region_iso_code", "region_name", "city_name", "time_zone", "location" ] }, "update_stats": { "last_succeeded_at_in_epoch_millis": 1724752680532, "last_processing_time_in_millis": 217775 } } ] }

  1. I created a pipeline and checked that it works: PUT /_ingest/pipeline/my-pipeline { "description":"convert ip to geo", "processors":[ { "ip2geo":{ "field":"clientip", "datasource":"my-datasource" } } ] }

{ "acknowledged": true }

POST _ingest/pipeline/my-pipeline/_simulate { "docs": [ { "_index": "testindex1", "_id": "1", "_source": { "clientip": "185.35.83.97" } } ] }

{ "docs": [ { "doc": { "_index": "testindex1", "_id": "1", "_source": { "ip2geo": { "continent_name": "Europe", "country_name": "Norway", "location": "59.9452,10.7559", "country_iso_code": "NO", "time_zone": "Europe/Oslo" }, "clientip": "185.35.83.97" }, "_ingest": { "timestamp": "2024-08-28T08:55:16.048315377Z" } } } ] }

PUT /nginx-2024.08.28/_doc/my-id?pipeline=my-pipeline { "clientip": "185.35.83.97" }

{ "_index": "nginx-2024.08.28", "_id": "my-id", "_version": 4, "result": "updated", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 24950455, "_primary_term": 1 }

GET /nginx-2024.08.28/_doc/my-id

{ "_index": "nginx-2024.08.28", "_id": "my-id", "_version": 4, "_seq_no": 24950455, "_primary_term": 1, "found": true, "_source": { "ip2geo": { "continent_name": "Europe", "country_iso_code": "NO", "country_name": "Norway", "location": "59.9452,10.7559", "time_zone": "Europe/Oslo" }, "clientip": "185.35.83.97" } }

  1. I recreated the index nginx-2024.08.28 and saw the fields ip2geo.continent_name, ip2geo.country_name and so on

  2. I can't find them through Discover. And I don't see them on the map. Снимок экрана 2024-08-28 в 12 06 40 Снимок экрана 2024-08-28 в 12 09 16

I understand that if I make a request, the data comes in. But why doesn't it work automatically? Data with the clientip field is constantly coming in

GET /nginx-2024.08.28/

{ "nginx-2024.08.28": { "aliases": {}, "mappings": { "properties": { "@timestamp": { "type": "date" }, ..... "clientip": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, ....

What is the expected behavior? I am waiting for the data in these fields to be used in the map

What is your host/environment?

heemin32 commented 2 weeks ago

The "location": "59.9452,10.7559", does not map to geo point automatically. https://opensearch.org/docs/latest/field-types/supported-field-types/geo-point/

You need to explicitly say it is geo point in your index mapping.

sasha2484 commented 2 weeks ago

I added point and ip: PUT /nginx*/_mapping?pretty { "properties": { "point": { "type": "geo_point" } } }

{ "acknowledged": true }

PUT /nginx*/_mapping?pretty { "properties": { "ip_address" : { "type" : "ip" } } }

{ "acknowledged": true }

after that I actually see Geospatial field - point. But it seems to contain no data, since there is nothing on the map. I'm guessing that the pipeline itself is not working correctly and is not seeing the clientip field. It's like I'm missing something. Снимок экрана 2024-08-29 в 12 21 09

Don’t pay attention to the different index names, I recreate them periodically

sasha2484 commented 2 weeks ago

I can't figure out why I see fields in the index but don't see them in Discover. Maybe there is an error in this and there is no data in the fields? Снимок экрана 2024-08-29 в 12 14 48 Снимок экрана 2024-08-29 в 12 14 04

sasha2484 commented 1 week ago

It's very strange. I definitely have a database and a processor that should handle the clientip field, which has ip addresses. At the same time, I looked at the statistics of this processor and everything is zero, no errors or treatments. Although if I make a request through this processor, everything works. It's as if he doesn't see this clientip field. Although this field definitely contains ip addresses. { "nodes": { "bjAXJdRNSn6BRwOWMXSgFA": { "ingest": { "total": { "count": 4379, "time_in_millis": 9274, "current": 0, "failed": 0 }, "pipelines": { .... "my-processor": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0, "processors": [ { "ip2geo": { "type": "ip2geo", "stats": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0 } } ... Снимок экрана 2024-09-03 в 12 44 39

I no longer know where to look According to the instructions, everything is correct. The processor does not seem to see this field.

{ "my-processor2": { "description": "convert ip to country", "processors": [ { "ip2geo": { "datasource": "country-datasource", "field": "clientip", "properties": [ "country_name" ], "ignore_failure": true } } ] } }

GET /_ingest/pipeline/my-processor2/_simulate { "docs":[ { "_source":{ "clientip":"2001:2000::" } } ] }

{ "docs": [ { "doc": { "_index": "_index", "_id": "_id", "_source": { "ip2geo": { "country_name": "Sweden" }, "clientip": "2001:2000::" }, "_ingest": { "timestamp": "2024-09-03T09:48:02.937300222Z" } } } ] }

GET /nginx-2024.09.03/_mapping?pretty

{ "nginx-2024.09.03": { "mappings": { "properties": { .... "clientip": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },

heemin32 commented 1 week ago

HI @sasha2484. I am confused what is the actual issue you are facing now. Could you provide a clear step to reproduce the issue and tell the difference between actual behavior and expected behavior?

sasha2484 commented 1 week ago

The general point is that pipeline does not read the ip from the clientid field. My steps are as follows:

  1. I'm prescribing my-datasource: PUT /_plugins/geospatial/ip2geo/datasource/my-datasource { "endpoint" : "https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json", "update_interval_in_days" : 1 } => { "acknowledged": true }

  2. I see that it works: GET /_plugins/geospatial/ip2geo/datasource/my-datasource => { "datasources": [ { "name": "my-datasource", "state": "AVAILABLE", "endpoint": "https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json", "update_interval_in_days": 1, "next_update_at_in_epoch_millis": 1724839387155, "database": { "provider": "maxmind", "sha256_hash": "t7FahuRg6Pjw+kcP0F29ZFAni4HEbX5WJC+1M38hzLU=", "updated_at_in_epoch_millis": 1724427053000, "valid_for_in_days": 30, "fields": [ "country_iso_code", "country_name", "continent_name", "region_iso_code", "region_name", "city_name", "time_zone", "location" ] }, "update_stats": { "last_succeeded_at_in_epoch_millis": 1724752680532, "last_processing_time_in_millis": 217775 } } ] }

  3. I created a pipeline: PUT /_ingest/pipeline/my-pipeline { "description":"convert ip to geo", "processors":[ { "ip2geo":{ "field":"clientip", "datasource":"my-datasource" } } ] }

=>

{ "acknowledged": true }

  1. checked pipeline it works:

POST _ingest/pipeline/my-pipeline/_simulate { "docs": [ { "_index": "testindex1", "_id": "1", "_source": { "clientip": "185.35.83.97" } } ] }

=>

{ "docs": [ { "doc": { "_index": "testindex1", "_id": "1", "_source": { "ip2geo": { "continent_name": "Europe", "country_name": "Norway", "location": "59.9452,10.7559", "country_iso_code": "NO", "time_zone": "Europe/Oslo" }, "clientip": "185.35.83.97" }, "_ingest": { "timestamp": "2024-08-28T08:55:16.048315377Z" } } } ] }

  1. I see that everything works when querying with my hands. And now I expect that in any new index, if there is a clientip field in it, the pipeline "my-pipeline" will be triggered, which will create additional fields that will contain information about the city, country, coordinates, etc. After the information in these fields appears, I can apply geo point on them, which will allow me to display this information on the map.
  2. The main problem is that, apparently, pipeline itself does not want to process the clientiip field for some reason, although the ip addresses are there (in the screenshot above). I ran a pipeline check and saw that it didn't work once. Moreover, there are no mistakes. { "nodes": { "bjAXJdRNSn6BRwOWMXSgFA": { "ingest": { "total": { "count": 4379, "time_in_millis": 9274, "current": 0, "failed": 0 }, "pipelines": { .... "my-pipeline": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0, "processors": [ { "ip2geo": { "type": "ip2geo", "stats": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0 } } ...
heemin32 commented 1 week ago

And now I expect that in any new index, if there is a clientip field in it, the pipeline "my-pipeline" will be triggered

Could you tell how did you ingest the doc? I see you were able to process clientip field in your previous example. It does not work anymore?

PUT /nginx-2024.08.28/_doc/my-id?pipeline=my-pipeline
{
"clientip": "185.35.83.97"
}

{
"_index": "nginx-2024.08.28",
"_id": "my-id",
"_version": 4,
"result": "updated",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 24950455,
"_primary_term": 1
}

GET /nginx-2024.08.28/_doc/my-id

{
"_index": "nginx-2024.08.28",
"_id": "my-id",
"_version": 4,
"_seq_no": 24950455,
"_primary_term": 1,
"found": true,
"_source": {
"ip2geo": {
"continent_name": "Europe",
"country_iso_code": "NO",
"country_name": "Norway",
"location": "59.9452,10.7559",
"time_zone": "Europe/Oslo"
},
"clientip": "185.35.83.97"
}
}
sasha2484 commented 1 week ago

I used the instructions from here: https://opensearch.org/docs/2.15/ingest-pipelines/processors/ip2geo/ And from here: https://opensearch.net/blog/new-ip2geo-processor-with-automatic-update/

PUT /nginx-2024.09.05/_doc/my-id?pipeline=my-pipeline { "clientip": "185.35.83.97" } => { "_index": "nginx-2024.09.05", "_id": "my-id", "_version": 5, "result": "updated", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 75963415, "_primary_term": 1 }

GET /nginx-2024.09.05/_doc/my-id => { "_index": "nginx-2024.09.05", "_id": "my-id", "_version": 2, "_seq_no": 75963412, "_primary_term": 1, "found": true, "_source": { "ip2geo": { "continent_name": "Europe", "country_iso_code": "NO", "country_name": "Norway", "location": "59.9452,10.7559", "time_zone": "Europe/Oslo" }, "clientip": "185.35.83.97" } }

That is, when I poison such a request with my hands through DevTools, then everything works as expected. But pipeline itself does not want to work in automatic mode for new indexes and new data that come in. I create a new nginx index every day-{date}, where the clientip field with ip addresses is present. It feels like I'm missing some little thing, but I can't find it in any way. I've tried creating different pipelines, with different names, but none of them want to work.

sasha2484 commented 1 week ago

I'm trying to see how many documents the "my-pipeline" pipeline has processed in total, but I get zeros GET /_nodes/stats/ingest?filter_path=nodes.*.ingest => { "nodes": { "bjAXJdRNSn6BRwOWMXSgFA": { "ingest": { "total": { "count": 4379, "time_in_millis": 9274, "current": 0, "failed": 0 }, "pipelines": { .... "my-pipeline": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0, "processors": [ { "ip2geo": { "type": "ip2geo", "stats": { "count": 0, "time_in_millis": 0, "current": 0, "failed": 0 } } } ] }, ....

I am creating a new index pattern "nginx*" in which I see new fields ip2geo.continent_name and the like. Logically, I should see the data in them, but I don't see it through Discover.

sasha2484 commented 1 week ago

At the same time, the simulation works GET /_ingest/pipeline/my-pipeline/_simulate { "docs":[ { "_source":{ "clientip":"94.131.3.90" } } ] }

=>

{ "docs": [ { "doc": { "_index": "_index", "_id": "_id", "_source": { "ip2geo": { "continent_name": "Europe", "region_iso_code": "BE", "city_name": "Bern", "country_iso_code": "CH", "country_name": "Switzerland", "region_name": "Bern", "location": "46.9786,7.4483", "time_zone": "Europe/Zurich" }, "clientip": "94.131.3.90" }, "_ingest": { "timestamp": "2024-09-06T08:31:40.035983293Z" } } } ] }

heemin32 commented 1 week ago

Could you share how you ingest document with automatic mode?

sasha2484 commented 3 days ago

I accept messages from bots, process it with logstash filters and send it to Opensearch. Logstash Filter:

}  else if "nginx" in [tags] {
    grok {
        match => {
            "message" => [
                "%{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} route:(?:%{PROG:route}|) webfarm:(?:%{PROG:webfarm}|) host:%{PROG:site}",
                "%{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{GREEDYDATA:ident} \[%{HTTPDATE:timestamp}\] \"%{URIPROTO:verb} %{URIPATHPARAM:request}(?: HTTP/%{NUMBER:httpversion})\" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} route:(?:%{PROG:route}|) webfarm:(?:%{PROG:webfarm}|) host:%{PROG:site}",
                "(?<timestamp>%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME})\s\[%{WORD:eventlevel}\]\s%{POSINT:pid}#%{NUMBER:threadid}\:\s\*%{NUMBER:connectionid}\s%{GREEDYDATA:error}zone\s\"%{WORD:zone}\"\,\sclient\:\s%{IPV4:client_ip}\,\sserver\:\s%{HOSTNAME:server}\,\srequest:\s\"%{WORD:method}\s\/?%{DATA}\/%{INT}\/%{DATA:base}\/%{WORD:service}\/%{GREEDYDATA}",
                "(?<timestamp>%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME}) \[%{LOGLEVEL:severity}\] %{POSINT:pid}#%{NUMBER:threadid}\: \*%{NUMBER:connectionid} %{GREEDYDATA:eventmessage}, client: %{IP:client}, server: %{IPORHOST:server}, request: \"%{WORD:req.verb}%{SPACE}/%{NOTSPACE:req.webfarm}/%{NOTSPACE:req.clientname}_%{DATA:req.dbindex}/(?:|%{NOTSPACE:req.apppath})(?: HTTP/%{NUMBER:req.httpversion})\", upstream: \"%{GREEDYDATA:upstream}\", host: \"%{GREEDYDATA:host}\"",
                "(?<timestamp>%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME}) \[%{LOGLEVEL:severity}\] %{POSINT:pid}#%{NUMBER:threadid}\: \*%{NUMBER:connectionid} %{GREEDYDATA:eventmessage}, client: %{IP:client}, server: %{IPORHOST:server}, request: \"%{WORD:req.verb}%{SPACE}/%{NOTSPACE:req.webfarm}/%{NOTSPACE:req.clientname}_%{DATA:req.dbindex}/(?:|%{NOTSPACE:req.apppath})(?: HTTP/%{NUMBER:req.httpversion})\", host: \"%{GREEDYDATA:host}\"",
                "(?<timestamp>%{YEAR}[./]%{MONTHNUM}[./]%{MONTHDAY} %{TIME}) \[%{LOGLEVEL:severity}\] %{POSINT:pid}#%{NUMBER:threadid}\: \*%{NUMBER:connectionid} %{GREEDYDATA:eventmessage}, client: %{IP:client}, server: %{IPORHOST:server}"
            ]
        }
    }
    date {
        match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z", "yyyy/MM/dd HH:mm:ss"]
        target => "@timestamp"
    }
    mutate {
        remove_field => ["timestamp"]
    }