grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.59k stars 485 forks source link

remote.s3 not pulling in yaml file - empty string #4704

Closed MaxDiOrio closed 9 months ago

MaxDiOrio commented 1 year ago

What's wrong?

Grafana Agent in flow mode installed in K8S.

I have a snmp.yml for the prometheus.exporter.snmp that I have stored in an S3 bucket. I generated the yaml file since I needed to add a module that isn't in the default. The UI shows that the S3 component is healthy, but the contents is an empty string "". I validated the yaml file is valid.

To see if it's a bucket permission issue, I uploaded a plain text file, with a simple string in it. That file load successfully and displays it's contents in the UI.

The prometheus scrape of the snmp exporter is failing (even though it says healthy), due to "unknown module", which to me says it either isn't using the S3 file, or the S3 file isn't being loaded and therefore the module is missing in the default.

The agent console shows no errors. There are no debug options and no ability to check the contents returned that I can tell as it's in memory.

Steps to reproduce

Compile a new snmp.yml file with an additional module. Upload the file to an S3 bucket. Configure the river file to use remote.s3 to pull the file. In the UI, it shows the contents as "" and the remote.s3 as healthy.

System information

No response

Software version

Grafana Agent v0.35.2 in flow mode

Configuration

remote.s3 "snmpconfig" {
      path = "s3://grafana-agent-bucket/snmp.yml"
      poll_frequency = "5m"
      is_secret = false
    }  <-- empty string returned

    remote.s3 "hi_folder" {
      path = "s3://grafana-agent-bucket/monitoring-cluster/hi.txt"
      poll_frequency = "5m"
      is_secret = false
    } <-- works fine

    remote.s3 "hi_root" {
      path = "s3://grafana-agent-bucket/hi.txt"
      poll_frequency = "5m"
      is_secret = false
    }  <-- works fine

    prometheus.exporter.snmp "netscaler" {
      config = remote.s3.snmpconfig.content

      target "publicProdNetscaler" {
        address = "172.0.0.123"
        module = "netscaler"
        auth = "network"
      }
      target "dmzNetscaler" {
        address = "netscalerADChostnameorip"
        module = "netscaler"
        auth = "network"
      }
    }

    prometheus.scrape "netscaler" {
      targets = prometheus.exporter.snmp.netscaler.targets
      forward_to = [prometheus.relabel.snmp.receiver]
    }

--------
Truncated snmp.yml

auths:
modules:
  if_mib:
  keepalived:
  netscaler:

Logs

Name    Value
job 
"prometheus.scrape.netscaler"
url 
"http://agent.internal:12345/api/v0/component/prometheus.exporter.snmp.netscaler/metrics?auth=network&module=netscaler&target=172.0.0.123"
health  
"down"
labels  
{
    instance = "k8s-workernode",
    job      = "integrations/snmp/publicProdNetscaler",
}
last_error  
"server returned HTTP status 400 Bad Request"
last_scrape 
"2023-08-03T16:47:17.021091026Z"
last_scrape_duration    
"373.166µs"

When I portfoward and hit that URL endpoint:

Unknown module 'netscaler'
MaxDiOrio commented 1 year ago

I feel like it could possibly be related to the size of the file - the one I'm trying to load is 20k lines? I uploaded a regular helm yaml file and that works perfectly fine - albeit on one line vs displayed properly.

rfratto commented 1 year ago

If you hit the /metrics endpoint of the agent, what are the values for these two metrics?

MaxDiOrio commented 1 year ago

agent_remote_s3_errors_total{component_id="remote.s3.snmpconfig"} 0 agent_remote_s3_timestamp_last_accessed_unix_seconds{component_id="remote.s3.snmpconfig"} 1.6911665032845545e+09

From the UI:

Latest health message (2023-08-04T16:28:23.284554786Z)
s3 file updated

Arguments: path: "s3://bucketname/monitoring-cluster/snmp.yml" poll_frequency | "5m0s"

Exports: content: ""

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The issue will not be closed automatically, but a label will be added to it for tracking purposes.

If the opened issue is a bug, check if newer releases have fixed the issue. If the issue is no longer relevant, please feel free to close it. Thank you for your contributions!

rfratto commented 9 months ago

We've done some digging into this and identified that the problem is with getObject, where we weren't properly handling reads from S3. There's a branch with what we believe is a fix, and a PR should be coming soon.