grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.38k stars 3.39k forks source link

Allow promtail GeoIP stage to extract data #9076

Open Tireg opened 1 year ago

Tireg commented 1 year ago

Is your feature request related to a problem? Please describe. The new GeoIP stage of Promtail has been just implemented with #3493 and it is great.

However, the GeoIP data flowing directly to labels seems problematic, because most of this data are high cardinality. Moreover, it seems we can't push the label values to the logs or use them in other stages (tried with the pack filter and the replace filter as described in the labeldrop documentation, but it seems that those labels values are not reachable at this point of the pipeline).

Wouldn't it be better to pick it as "extracted data" instead of labels, so that users can choose whether they want to add them as label or as log output in later stages ?

Describe the solution you'd like Do not send the data to labels at all in this stage. Only send the data to the "extracted data" map.

This would be more coherent with the stage being a "Parsing stage" according to the documentation.

Moreover, there already exists stages that allow to pick extracted data and put it in labels. This is easier to compose with.

Describe alternatives you've considered Keep the current implementation (which sends data to labels), but also send it to the extracted data map.

This seems a bit redundant and would ask users to frequently drop most labels with an extra labeldrop stage as they are high cardinality.

Additional context I already tried to patch promtail locally (the way described in this feature request) and it seems to work flawlessly. I'm planning to make multiple clean PRs (because my current patch also add new things and fixes others) ASAP.

In case someone wants to fix this before I do, here is an extract of the full patch that shows the idea of how to implement this:

-func (g *geoIPStage) populateLabelsWithASNData(labels model.LabelSet, record *geoip2.ASN) {
+func (g *geoIPStage) populateExtractedWithASNData(extracted map[string]interface{}, record *geoip2.ASN) {
    autonomousSystemNumber := record.AutonomousSystemNumber
    autonomousSystemOrganization := record.AutonomousSystemOrganization
    if autonomousSystemNumber != 0 {
-       labels[model.LabelName("geoip_autonomous_system_number")] = model.LabelValue(fmt.Sprint(autonomousSystemNumber))
+       extracted["geoip_autonomous_system_number"] = autonomousSystemNumber
    }
    if autonomousSystemOrganization != "" {
-       labels[model.LabelName("geoip_autonomous_system_organization")] = model.LabelValue(autonomousSystemOrganization)
+       extracted["geoip_autonomous_system_organization"] = autonomousSystemOrganization
    }
 }
chaudum commented 1 year ago

Hi @Tireg

I agree with the points you raised. geoIP should behave like other parsing stages. Unfortunately, this is would be a breaking change.

Let's wait for some feedback from the community.

megheaiulian commented 1 year ago

This could be a non-breaking change if there was a optional extra mapping or expressions option. When present user can control how fields are mapped (like expressions in json). When absent it could fallback to the current behavior.

PsySuck commented 1 year ago

I agree that geoip in this form is impossible to use. Too many labels.

nikolabura commented 1 year ago

Agreed, the fact that you can't remove geoip labels (via pack or similar) makes it almost useless. I'm not sure how we're expected to use it like this when the documentation specifically says to avoid high-cardinality labels. Why is GeoIP a client-side (promtail) stage anyways instead of being applied server-side (LogQL)?

esev commented 1 year ago

Would it be acceptable to populate both the labels and the extracted map with the same values?

I lost a few hours today trying to use the geoip output in pack and template stages. It was surprising that geoip doesn't work like other stages.

This is what I'd like to have work. Currently none of the geoip_* labels are packed, and as a result no geoip values are stored.

  # Pack ip_address and geoip values into the output message.
  - match:
      selector: >-
        {ip_address=~".+"}
      stages:
      - geoip:
          db: "/etc/geoip/GeoLite2-City.mmdb"
          source: "ip_address"
          db_type: "city"
      - geoip:
          db: "/etc/geoip/GeoLite2-ASN.mmdb"
          source: "ip_address"
          db_type: "asn"
      - pack:
          labels:
          - geoip_city_name
          - geoip_country_name
          - geoip_continet_name
          - geoip_continent_code
          - geoip_location_latitude
          - geoip_location_longitude
          - geoip_postal_code
          - geoip_timezone
          - geoip_subdivision_name
          - geoip_subdivision_code
          - geoip_autonomous_system_number
          - geoip_autonomous_system_organization
      - template:
          source: message
          template: >-
            {{- $entry := mustFromJson .Entry -}}
            {{- $orig := get $entry "_entry" | mustFromJson -}}
            {{- omit (mustMerge $orig $entry) "_entry" | mustToJson -}}
      - output:
          source: message
      - labelallow:
        - instance
        - job
        - project
        - service
        - stream
imtipi commented 1 year ago

I tried to build promtail on my own for use first, as it may take a while for PR to be accepted. But when I run apt install -y libsystemd-dev and go build --tags=promtail_journal_enabled . /clients/cmd/promtail, I still get prompted that journal logging is not enabled. am I the only one with this issue?

edit: solved,i forgot set cgo=1

superstes commented 9 months ago

Also ran into this issue. And searched for a solution for waaay too long..

I wanted to create different geoip-labels for source- & destination IPs:

    pipeline_stages:
      - match:
          selector: '{facility="kern"} |= "NFTables"'
          stages:
            - regex:
                expression: '\sSRC=(?P<packet_ip_src>.*?)\s'
            - regex:
                expression: '\sDST=(?P<packet_ip_dst>.*?)\s'
            - geoip:
                db: "/etc/geoip/asn.mmdb"
                source: "packet_ip_src"
                db_type: "asn"
            - template:
                source: 'packet_geoip_src_asn'
                template: '{{or .geoip_autonomous_system_number "-"}}'
            - labeldrop:
                - 'geoip_autonomous_system_number'
...

But as the data does not exist in extracted it will always return <no data> or the default-value - in my case.

superstes commented 9 months ago

To mention it: One could migrate to grafana-agent which implements a fork of promtail that has this issue fixed. See grafana-agent log-config, grafana-agent docs and grafana-agent PR