toni-moreno / snmpcollector

A full featured Generic SNMP data collector with Web Administration Interface for InfluxDB
MIT License
286 stars 52 forks source link

don't use alpine to run a container if you're building on a glibc system #391

Closed jensenja closed 4 years ago

jensenja commented 5 years ago

I've been attempting to change over my Influx servers + relays over to HTTPS and as part of that I'm configuring my pollers to send to the relays over HTTPS as well.

All of my collectors and their associated databases are running in a Docker user-defined network. haproxy is also connected to this network to allow for terminating SSL and passing along the requests via HTTP to the web UI frontends of snmpcollector on 8090.

I have a wildcard SSL certificate for *.eng.domain.local (specific domain is redacted) that I have successfully configured on my Influx servers, the relays, Kapacitor, etc. The next logical step is to configure my snmpcollector instances to send to the FQDN of the Influx servers over SSL. The problem is due to the fact that snmpcollector is failing to ping and write points to the relays because either Go or snmpcollector is refusing to resolve the hostname because it ends in the .local TLD.

Within snmpcollector, I can't use just the IP of the server anymore (because of SSL/FQDN/etc). I verified with tcpdumps that if I just use the short name, the Docker DNS resolver queries my upstream nameservers for the IP, but the ping in the web UI fails obviously because the certificate is supposed to match the wildcard with the FQDN. Same thing happens when I add .eng to the short name, then .domain to that. But when I add the .local TLD, there are no DNS queries made and instead you see the error with the ping.

The container itself can ping the FQDN just fine and get responses. I've also tried restarting the container with the --add-host flag to force an entry into /etc/hosts on the container and it still doesn't work. Here are the log entries:

time="2019-01-18 21:57:39" level=error msg="failed connecting to:  va1-netops-netmon-01.eng.domain.local"
time="2019-01-18 21:57:39" level=error msg="error:  Get https://va1-netops-netmon-01.eng.domain.local:9096/ping?wait_for_leader=30s: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host"
time="2019-01-18 21:57:39" level=error msg="failed connecting to:  va1-netops-netmon-01.eng.domain.local"
time="2019-01-18 21:57:39" level=error msg="error:  Get https://va1-netops-netmon-01.eng.domain.local:9096/ping?wait_for_leader=30s: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host"
time="2019-01-18 21:58:00" level=error msg="ERROR on Write batchPoint in DB snmp-east-brocade (8 points) | elapsed : 608.721µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_brcd&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:00" level=error msg="ERROR on Write batchPoint in DB snmp-east-bip (104 points) | elapsed : 559.895µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_bip&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:10" level=error msg="ERROR on Write batchPoint in DB snmp-east-brocade (8 points) | elapsed : 646.479µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_brcd&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:10" level=error msg="ERROR on Write batchPoint in DB snmp-east-bip (104 points) | elapsed : 624.897µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_bip&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:20" level=error msg="ERROR on Write batchPoint in DB snmp-east-brocade (8 points) | elapsed : 525.697µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_brcd&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:20" level=error msg="ERROR on Write batchPoint in DB snmp-east-bip (101 points) | elapsed : 537.054µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_bip&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:30" level=error msg="ERROR on Write batchPoint in DB snmp-east-brocade (8 points) | elapsed : 337.574µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_brcd&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "
time="2019-01-18 21:58:30" level=error msg="ERROR on Write batchPoint in DB snmp-east-bip (7 points) | elapsed : 344.663µs | Error: Post https://va1-netops-netmon-01.eng.domain.local:9096/write?consistency=&db=snmp_bip&precision=s&rp=30d: dial tcp: lookup va1-netops-netmon-01.eng.domain.local: no such host "

Here are the results from me entering the container and trying to ping the influx relay:

jjensen@va1-netops-bastion-01:~$ docker exec -it snmpcol_vadc /bin/sh
/opt/snmpcollector # ping va1-netops-netmon-01.eng.domain.local
PING va1-netops-netmon-01.eng.domain.local (10.15.2.207): 56 data bytes
64 bytes from 10.15.2.207: seq=0 ttl=63 time=0.460 ms
64 bytes from 10.15.2.207: seq=1 ttl=63 time=0.350 ms
64 bytes from 10.15.2.207: seq=2 ttl=63 time=0.473 ms
^C
--- va1-netops-netmon-01.eng.domain.local ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.350/0.427/0.473 ms

For further testing, I went ahead and spun up a golang docker container running on alpine:3.8 (that's also connected to this same user-defined network to eliminate my Docker setup from the equation) and did a contrived example of using a ping with the Influx client library:

package main

import (
    "fmt"
    "log"
    "net/url"

    client "github.com/influxdata/influxdb1-client"
)

func main() {
    fmt.Println("hello world\n")
    ExampleClient_Ping()
}

func ExampleClient_Ping() {
    host, err := url.Parse(fmt.Sprintf("http://%s:%d", "va1-netops-netmon-01.eng.domain.local", 9096))
    if err != nil {
        log.Fatal(err)
    }
    con, err := client.NewClient(client.Config{URL: *host})
    if err != nil {
        log.Fatal(err)
    }

    dur, ver, err := con.Ping()
    if err != nil {
        log.Fatal(err)
    }
    log.Printf("Happy as a hippo! %v, %s", dur, ver)
}

Doing some research on the net golang library and how it performs name resolution, it does something special for .local TLDs in how it chooses either the C libraries for name resolution or the pure Go libraries for name resolution.

https://golang.org/pkg/net/

Specifically:

By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine,
while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based
resolver is used instead under a variety of conditions: on systems that do not let programs make direct
DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when
the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG
environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf
specify the use of features that the Go resolver does not implement, and when the name being looked
up ends in .local or is an mDNS name.

The results from running the code above in the golang:alpine container are the same no matter if I use the cgo resolver or the pure go resolver.

/go/src/influxtest # go version
go version go1.11.4 linux/amd64

/go/src/influxtest # GODEBUG=netdns=cgo+1 go run example.go
hello world

go package net: using cgo DNS resolver
2019/01/19 18:49:12 Happy as a hippo! 3.302823ms, relay

/go/src/influxtest # GODEBUG=netdns=go+1 go run example.go
hello world

go package net: GODEBUG setting forcing use of Go's resolver
2019/01/19 18:49:20 Happy as a hippo! 4.630835ms, relay

Also of note is that snmpcollector's behavior doesn't change even if it's HTTP or HTTPS - it just refuses to resolve influx server hostnames that end in .local.

Can someone help? Thanks.

jensenja commented 5 years ago

After further testing, it seems as though this is because the golang source build is being done on a glibc system but the Docker build script is using alpine:latest which uses musl libc. Relevant build log entries on a test system:

go build -ldflags -w -X github.com/toni-moreno/snmpcollector/pkg/agent.Version=0.8.0 -X github.com/toni-moreno/snmpcollector/pkg/agent.Commit=735ca2b -X github.com/toni-moreno/snmpcollector/pkg/agent.BuildStamp=1542215298 -linkmode external -extldflags -static -v -o ./bin/snmpcollector ./pkg/
# github.com/toni-moreno/snmpcollector/pkg
/tmp/go-link-932154974/000009.o: In function `unixDlOpen':
/home/jjensen/go/src/github.com/toni-moreno/snmpcollector/vendor/github.com/mattn/go-sqlite3/sqlite3-binding.c:35900: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-932154974/000014.o: In function `_cgo_18049202ccd9_C2func_getaddrinfo':
/tmp/go-build/cgo-gcc-prolog:49: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

I changed the base OS in the container build script/Dockerfile from alpine:latest to debian:stretch and built a container from source, and the .local hostnames are resolved without issue.

Not sure what the desired fix for this should be. Either change the base container OS to one that uses glibc, or sort out the build requirements (sqlite seems to be the big one) to accommodate musl libc for running on alpine.

hyberdk commented 5 years ago

Hi jensenja,

I sorta had the same problem, so I did a Dockerfile where it compiles the source in a apline container and then uses the output from there to build a new one..

You are welcome to try it out: https://github.com/hyberdk/snmpcollector-docker

You can ofcourse also just grab my image from docker-hub using: "docker pull hyber/snmpcollector"

Esben

toni-moreno commented 5 years ago

Hi @hyberdk , I like a lot your work, would like to update our DockerImage with yours. I would like you can submit a PR yourserlf with this improvement.

could you do it?

Thank you very much

jensenja commented 5 years ago

Thanks for your work on this @hyberdk - I think it would be good to merge your changes into snmpcollector. To get around my issue I just created a debian:stretch base container and installed the latest .deb package inside of it.

hyberdk commented 5 years ago

could you do it?

its here: pull request #401

toni-moreno commented 4 years ago

Hi @jensenja I've taken the @hyberdk PR and added some other improvements ! Thank you very much Esben! closed by 1910c3758917e219d3ae8c58548431e45cb527da