vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.32k stars 1.51k forks source link

AWS ECS Fargate 1.4.0 with splunk and vector #9103

Open shopwareCloud opened 2 years ago

shopwareCloud commented 2 years ago

Vector Version

0.16.1

Vector Configuration File

# Set global options
data_dir = "/var/lib/vector"

# Ingest data by tailing one or more files
[sources.xx]
type    = "splunk_hec"
address = "0.0.0.0:8088"
token   = "${splunk_token}"
valid_tokens = [ "${splunk_token}" ]

# Send data to a cost-effective long-term storage
[sinks.s3_archive]
inputs         = ["xx"]
type           = "aws_s3"
region         = "${region}"
bucket         = "${s3_bucket}" # todo variable
key_prefix     = "date=%Y-%m-%d"       # daily partitions, hive friendly format
compression    = "gzip"                          # compress final objects
encoding       = "ndjson"                         # new line delimited JSON
batch.max_size = 10000000                 # 10mb uncompressed
healthcheck    = false                             # relies on CreateBucket and is disabled

# Send data to datadog for real time monitoring
[sinks.datadog]
type            = "datadog_logs"
inputs          = [ "xx" ]
default_api_key = "${datadog_api_key}"
compression     = "gzip"
site            = "datadoghq.eu"
region          = "eu"
...

Debug Output

Nothing really to see here.

Expected Behavior

Main ECS container should connect to vector via splunk endpoint and vector should forward these logs to datadog and s3.

Actual Behavior

With Fargate 1.3.0 this works, with the latest platform version, first the health check from the documentation failed. I fixed that by adding the -s parameter to curl and rebuild the vector alpine image with curl. Probebly related to https://github.com/aws/containers-roadmap/issues/898

But the tasks still fails with the error message:

Stopped reason: ResourceInitializationError: failed to validate logger args: Options http://127.0.0.1:8088/services/collector/event/1.0: dial tcp 127.0.0.1:8088: connect: connection refused : exit status 1

Example Data

-

Additional Context

ECS task definition

[
  {
    "name": "xxx",
    "image": "xxx:latest",
    "essential": true,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "http://localhost:8088",
        "splunk-token": "${splunk_token}"
      }
    },
    "dependsOn": [
      {"containerName": "vector", "condition": "HEALTHY"},
    ]
  },
  {
    "image": "jenskueper/vector:latest-alpine",
    "name": "vector",
    "essential": true,
    "healthCheck": {
      "command": [
        "CMD-SHELL",
        "curl -s -f http://127.0.0.1:8088/services/collector/health -H 'Authorization: Splunk ${splunk_token}' || exit 1"
      ],
      "interval": 5
    },
    "cpu": 10
  }
]

References

-

ConradKurth commented 2 years ago

@shopwareCloud hey! we are having the same issue, did you ever find a solution?

jaloren commented 3 months ago

I bypassed this by setting this log option in the log driver settings:

splunk-verify-connection = "false"