google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.15k stars 2.32k forks source link

CoreOS cAdvisor 0.4.1 Influxdb Storage not working #271

Closed udomsak closed 9 years ago

udomsak commented 10 years ago

I found issue cAdvisor 0.4.1 on CoreOS when using with Storage backend like Influxdb it can't send the data to Influxdb

Influxdb is working fine

My running environment

The error log said

E1015 10:46:18.415676 00001 memory.go:109] failed to write stats to influxDb - Post http://127.0.0.1:8086/db/cadvisor/series?u=root&p=root&time_precision=u: dial tcp 127.0.0.1:8086: connection refused

E1015 10:47:18.448986 00001 memory.go:109] failed to write stats to influxDb - Post http://127.0.0.1:8086/db/cadvisor/series?u=root&p=root&time_precision=u: dial tcp 127.0.0.1:8086: connection refused
xiangflytang commented 10 years ago

i met the same problem,do you create the "cadvisor" manually

xiangflytang commented 10 years ago

if this: curl --noproxy localhost -X POST -d '[{"name":"foo","columns":["val"],"points":[[23]]}]' 'http://localhost:8086/db/mydb/series?u=root&p=root' works maybe this problem comes from proxy.

andrewwebber commented 10 years ago

I have the same issue. Event hosting an influxdb on the same host results in the same error. I am not able to run the /storage/influxdb tests as I cant build the project

can't load package: package code.google.com/p/go.exp/inotify

vmarmol commented 10 years ago

@andrewwebber are you using godep for the build?

andrewwebber commented 10 years ago

no, i presume i should be?... i just do a go get

vmarmol commented 10 years ago

Yes, we moved to use godep for building and testing so you'll need to:

godep go build github.com/google/cadvisor
andrewwebber commented 10 years ago

i noticed within i notify only _linux go files. does this mean i cant build on my mac?

I assumed running the influxdb storage tests to be no linux related.

vmarmol commented 10 years ago

Hmmm that may be true as inotify is Linux only. cAdvisor depends on that functionality so it probably is failing to build for the test.

We really need to beef up the documentation around building cAdvisor. I'll try to get a page together today.

andrewwebber commented 10 years ago

the fact is i believe it is important to test the version of the influxdb go dependency with the latest version of influxdb. This is what i am trying to achieve. I will create a go program with the latests version of the client library for influxdb and try to execute a post within a docker container on a CoreOS machine. This would validate if the library works. Then do the same test with the version of the library specified in godep. This would help investigate if it is a library version issue or a networking issue

alessioguglielmo commented 10 years ago

... maybe the solution for my post?

jwalczyk commented 10 years ago

I'm new to both cadvisor and influxdb; also having this problem. On a Mac, I run google/cadvisor:0.5.0 in a docker container - Docker version 1.3.1, build 4e9bbfa

$docker logs E1107 09:09:48.189645 00001 memory.go:109] failed to write stats to influxDb - Post http://localhost:8086/db/cadvisor/series?u=root&p=root&time_precision=u: dial tcp 127.0.0.1:8086: connection refused

This works fine: curl --noproxy localhost -X POST -d '[{"name":"foo","columns":["val"],"points":[[123]]}]' 'http://localhost:8086/db/cadvisor/series?u=root&p=root' I can view metrics in the InfluxDB GUI via 'select * from foo' ,etc. @xiangflytang - What proxy are you reffering to?

vmarmol commented 10 years ago

@jwalczyk do you see this consistently? As in, every 1s or 60s, or every now and then? Do you see the data for the time period on or before 1107 09:09:48.189645? It may just have dropped that request.

andrewwebber commented 10 years ago

This is consistently because there is never any data logged to influxdb.

andrewwebber commented 10 years ago

my work around is now to mine cadvisor like heapster but implement a new sink in heapster to log to logstash

vishh commented 10 years ago

Is InfluxDB running on your host outside of docker? Since cadvisor is running in a separate network namespace, it will not be able to connect to the applications running on the host network namespace. You can solve this by passing '--net=host' option as part of docker run.

docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --detach=true --name=cadvisor --net=host google/cadvisor:latest -storage_driver=influxdb --logtostderr

jwalczyk commented 10 years ago

Thank you for the comments! I got it working. Silly mistake. @vishh got it and I had to change localhost to my host IP during cadvisor container startup. --storage_driver_host="my_host_ip:8086" I then tested this by running nc -kl 8086 on my host and saw all the nice data dumped. Thanks!

andrewwebber commented 10 years ago

Can confirm i am now seeing data, however this is all running in one host (boot2docker)

vishh commented 10 years ago

@andrewwebber: Where are you running InfluxDB? Chatting over IRC (#google-containers) might be faster that going back and forth here on github.

andrewwebber commented 10 years ago

@vishh Thanks for your support here. I have a kubernetes coreos cluster and got this working with the following fleet systemd unit files.

Unfortunatly the documentation for grafana did not work for me at is insisted in looking for a metadata endpoint for influxdb and elastic search. After deleting these and manually loading the kubernetes dashboard, everything worked (after editing a couple of graphs that were looking for a 'machines' time series where only 'stats' existed).

grafana (manually)

docker run -i -t --rm -p 80:80 -e INFLUXDB_HOST=192.168.89.161 -e INFLUXDB_PORT=8086 -e INFLUXDB_NAME=k8s -e INFLUXDB_USER=root -e INFLUXDB_PASS=root tutum/grafana

cadvisor (globally deployed)

[Unit]
Description=cAdvisor Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/docker kill cadvisor
ExecStartPre=-/usr/bin/docker rm -f cadvisor
ExecStartPre=/usr/bin/docker pull google/cadvisor
ExecStart=/usr/bin/docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=4194:4194 --name=cadvisor --net=host google/cadvisor:latest --logtostderr --port=4194
ExecStop=/usr/bin/docker stop -t 2 cadvisor

[X-Fleet]
Global=true
MachineMetadata=role=kubernetes

Influxdb

[Unit]
Description=InfluxDB Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/docker kill influxdb
ExecStartPre=-/usr/bin/docker rm -f influxdb
ExecStartPre=/usr/bin/docker pull kubernetes/heapster_influxdb
ExecStart=/usr/bin/docker run --name influxdb -p 8083:8083 -p 8086:8086 -p 8090:8090 -p 8099:8099 kubernetes/heapster_influxdb
ExecStop=/usr/bin/docker stop -t 2 influxdb

Heapster agent (buddy)

[Unit]
Description=Heapster Agent Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/mkdir -p /home/core/heapster
ExecStartPre=-/usr/bin/docker kill heapster-agent
ExecStartPre=-/usr/bin/docker rm -f heapster-agent
ExecStartPre=/usr/bin/docker pull vish/heapster-buddy-coreos
ExecStart=/usr/bin/docker run --name heapster-agent --net host -v /home/core/heapster:/var/run/heapster vish/heapster-buddy-coreos
ExecStop=/usr/bin/docker stop -t 2 heapster-agent

[X-Fleet]
MachineOf=influxdb.service

Heapster

[Unit]
Description=Heapster Agent Service
After=docker.service
After=heapster-agent.service
Requires=docker.service
Requires=heapster-agent.service

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/docker kill heapster
ExecStartPre=-/usr/bin/docker rm -f heapster
ExecStartPre=/usr/bin/docker pull vish/heapster
ExecStart=/usr/bin/docker run --name heapster --net host -e INFLUXDB_HOST=127.0.0.1:8086 -v /home/core/heapster:/var/run/heapster vish/heapster
ExecStop=/usr/bin/docker stop -t 2 heapster

[X-Fleet]
MachineOf=heapster-agent.service

At the moment i'm lazy and dont care if my heapster agents move around the cluster.

vishh commented 10 years ago

Awesome. Thanks for the write up. Splitting up grafana into a separate Pod is something I have been meaning to do, but there is no native support for external IPs in Kubernetes yet sadly. Your idea of manually configuring grafana sounds good for the short term.

Recent versions of heapster exports a new table 'machine' which contains all the root cgroup stats. So the grafana dashboard should work you as-is with the latest version.

If you are running heapster in Kubernetes, you don't have to run the heapster-buddy container, unless you run kubernetes only on a subset of machines in your CoreOS cluster.

On Fri, Nov 7, 2014 at 5:23 PM, andrew notifications@github.com wrote:

@vishh https://github.com/vishh Thanks for your support here. I have a kubernetes coreos cluster and got this working with the following fleet systemd unit files

Unfortunatly the documentation for grafana did not work for me at is insisted in looking for a metadata endpoint for influxdb https://github.com/GoogleCloudPlatform/heapster/blob/master/influx-grafana/grafana/set_influx_db.sh#L11 and elastic search https://github.com/GoogleCloudPlatform/heapster/blob/master/influx-grafana/grafana/set_elasticsearch.sh#L14. After deleting these and manually loading the kubernetes dashboard, everything worked (after editing a couple of graphs that were looking for a 'machines' time series where only 'stats' existed).

grafana (manually)

docker run -i -t --rm -p 80:80 -e INFLUXDB_HOST=192.168.89.161 -e INFLUXDB_PORT=8086 -e INFLUXDB_NAME=k8s -e INFLUXDB_USER=root -e INFLUXDB_PASS=root tutum/grafana

cadvisor (globally deployed)

[Unit] Description=cAdvisor Service After=docker.service Requires=docker.service

[Service] TimeoutStartSec=10m Restart=always ExecStartPre=-/usr/bin/docker kill cadvisor ExecStartPre=-/usr/bin/docker rm -f cadvisor ExecStartPre=/usr/bin/docker pull google/cadvisor ExecStart=/usr/bin/docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=4194:4194 --name=cadvisor --net=host google/cadvisor:latest --logtostderr --port=4194 ExecStop=/usr/bin/docker stop -t 2 cadvisor

[X-Fleet] Global=true MachineMetadata=role=kubernetes

Influxdb

[Unit] Description=InfluxDB Service After=docker.service Requires=docker.service

[Service] TimeoutStartSec=10m Restart=always ExecStartPre=-/usr/bin/docker kill influxdb ExecStartPre=-/usr/bin/docker rm -f influxdb ExecStartPre=/usr/bin/docker pull kubernetes/heapster_influxdb ExecStart=/usr/bin/docker run --name influxdb -p 8083:8083 -p 8086:8086 -p 8090:8090 -p 8099:8099 kubernetes/heapster_influxdb ExecStop=/usr/bin/docker stop -t 2 influxdb

Heapster agent (buddy)

[Unit] Description=Heapster Agent Service After=docker.service Requires=docker.service

[Service] TimeoutStartSec=10m Restart=always ExecStartPre=-/usr/bin/mkdir -p /home/core/heapster ExecStartPre=-/usr/bin/docker kill heapster-agent ExecStartPre=-/usr/bin/docker rm -f heapster-agent ExecStartPre=/usr/bin/docker pull vish/heapster-buddy-coreos ExecStart=/usr/bin/docker run --name heapster-agent --net host -v /home/core/heapster:/var/run/heapster vish/heapster-buddy-coreos ExecStop=/usr/bin/docker stop -t 2 heapster-agent

[X-Fleet] MachineOf=influxdb.service

Heapster

[Unit] Description=Heapster Agent Service After=docker.service After=heapster-agent.service Requires=docker.service Requires=heapster-agent.service

[Service] TimeoutStartSec=10m Restart=always ExecStartPre=-/usr/bin/docker kill heapster ExecStartPre=-/usr/bin/docker rm -f heapster ExecStartPre=/usr/bin/docker pull vish/heapster ExecStart=/usr/bin/docker run --name heapster --net host -e INFLUXDB_HOST=127.0.0.1:8086 -v /home/core/heapster:/var/run/heapster vish/heapster ExecStop=/usr/bin/docker stop -t 2 heapster

[X-Fleet] MachineOf=heapster-agent.service

At the moment i'm lazy and dont care if my heapster agents move around the cluster.

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/271#issuecomment-62239188.

MaheshRudrachar commented 10 years ago

@andrewwebber: Heapster Agent and Heapster Service getting restarted abruptly when I followed above Units using Fleet. I am running all those Units in one CoreOS instance which is configured as cluster in Kubernetes. Could you explain whether we need different CoreOS instances or all Units needs to run on one CoreOS instance.

Thanks in advance....

MaheshRudrachar commented 10 years ago

@andrewwebber: BTW, I am running Kubernetes cluster with CoreOS on EC2 instance...

vishh commented 10 years ago

@MaheshRudrachar: Can you try using 'kubernetes/heapster' image instead of 'vish/heapster' ?

Heapster

`[Unit] Description=Heapster Agent Service After=docker.service After=heapster-agent.service Requires=docker.service Requires=heapster-agent.service

[Service] TimeoutStartSec=10m Restart=always ExecStartPre=-/usr/bin/docker kill heapster ExecStartPre=-/usr/bin/docker rm -f heapster ExecStartPre=/usr/bin/docker pull vish/heapster ExecStart=/usr/bin/docker run --name heapster --net host -e INFLUXDB_HOST=127.0.0.1:8086 -v /home/core/heapster:/var/run/heapster kubernetes/heapster ExecStop=/usr/bin/docker stop -t 2 heapster

[X-Fleet] MachineOf=heapster-agent.service`

andrewwebber commented 10 years ago

With respect to the discovery issue of the influxdb database for grafana i am think about the following approaches.

Options 1:

Option 2:

Option 3:

andrewwebber commented 10 years ago

@MaheshRudrachar I have not run specifically into any issues you mentioned. However indirect issues probably due to the fact that i am low on hardware in my lab environment. I have influxdb, the heapster agents and grafana all running on my single etcd serving my cluster which also runs my private docker registry container :-).

Ultimately it makes sense to split these machines up into dedicated roles to better independently isolate from where issues might be originating. For example my etcd server suddenly was unable to start docker. In reality I shouldn't care because a production setup of an etcd cluster probably should not even be running docker containers.

I am also running the alpha branch with automatic update (reboot when update found) which doesn't always help. So I guess my tip would be to move out at least your units of the etcd servers

Also in my case i dont really care if my cadvisors crash or even if my influxdb, heapsters gets rebuild and run. Sacred are the etcd servers, if they go down your whole kubernetes cluster goes down too and you need to redeploy all of your pods, replication controllers and services.

vishh commented 10 years ago

@andrewwebber: I am exploring using a proxy as part of grafana to get to InfluxDB and ElasticSearch containers. This will help split up influxdb, elastic search and grafana. Since you mentioned that you run multiple containers on your etcd server, you could try placing resource limits on all the containers, or at least InfluxDB.

MaheshRudrachar commented 10 years ago

@andrewwebber & @vishh

Thanks for your inputs. Still not able to resolve this issue.

Here is my Kubernetes Setup Details:

I have setup 5 CoreOS instances on AWS and followed kelseyhightower kubernetes-fleet-tutorial. Basically I have 1 ETCD dedicated Server, 3 Minions with 1 Minion acting as API Server and 1 dedicated Minion for setting up Heapster. All Minions are pointing to ETCD dedicated server.

Now when I ran units which are mentioned above:

  1. cAdvisor, Influxdb & Grafana - Works fine with status running
  2. Heapster Agent: Fails with log message as follows: Starting Heapster Agent Service... heapster-agent Pulling repository vish/heapster-buddy-coreos Started Heapster Agent Service. INFO log.go:73: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: ERROR log.go:81: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100m timeout reached heapster-agent.service: main process exited, code=exited, status=1/FAILURE Unit heapster-agent.service entered failed state.
  3. Heapster: Fails with log message as follows: Pulling repository kubernetes/heapster Started Heapster Agent Service. /usr/bin/heapster --sink influxdb --sink_influxdb_host 127.0.0.1:8086 Heapster version 0.2 Cannot stat hosts_file /var/run/heapster/hosts. Error: stat /var/run/heapster/hosts: no such file or directory heapster.service: main process exited, code=exited, status=1/FAILURE heapster Unit heapster.service entered failed state. heapster.service holdoff time over, scheduling restart. Stopping Heapster Agent Service... Starting Heapster Agent Service...

Need your help in resolving this. Thanks

andrewwebber commented 10 years ago

@MaheshRudrachar @vishh This is due to the fact that the heapster buddy assumes fleet is running on the host. I had this issue and therefore had to run my heapster agents on the etcd node.

https://github.com/GoogleCloudPlatform/heapster/blob/master/clusters/coreos/buddy.go#L35

I believe we need to add a flag to the buddy to parameterise the fleet server url

vishh commented 10 years ago

Adding a flag to the buddy sounds good. I opened https://github.com/GoogleCloudPlatform/heapster/issues/11. Lets continue the discussion there.

I made a small change to the grafana container via https://github.com/GoogleCloudPlatform/heapster/pull/10. This should make the kubernetes version of heapster work outside of GCE. Give it a try and let me know if you face any issues.

On Thu, Nov 13, 2014 at 4:35 AM, andrew notifications@github.com wrote:

@MaheshRudrachar https://github.com/MaheshRudrachar @vishh https://github.com/vishh This is due to the fact that the heapster buddy assumes fleet is running on the host https://github.com/GoogleCloudPlatform/heapster/blob/master/clusters/coreos/buddy.go#L35. I had this issue and therefore had to run my heapster agents on the etcd node.

https://github.com/GoogleCloudPlatform/heapster/blob/master/clusters/coreos/buddy.go#L35

I believe we need to add a flag to the buddy to parameterise the fleet server url

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/271#issuecomment-62883967.

MaheshRudrachar commented 10 years ago

Thanks @vishh and @andrewwebber. I will give a try with latest version.

jzelinskie commented 9 years ago

@vmarmol to avoid confusion about building your project, you can use godep -r to rewrite your imports. It makes it so that your godep project can still be "go get-able"

vmarmol commented 9 years ago

We've been considering that and it does feel simpler (we'd also not need to use godep for build or test).

vmarmol commented 9 years ago

This issue seems over and taken over by other things :) closing. Feel free to open other issues if you run into anything.