Open mitalhp opened 10 years ago
I just saw the same myself with the latest version.
2015-07-07 16:03:09,444 [MainThread es2graphite.py :submi:174] [ERROR ] Communication to Graphite server failed: [Errno 32] Broken pipe
What's with the debug messages being url encoded anyhow?
2015-07-07 16:06:09,224 [MainThread es2graphite.py :submi:175] [DEBUG ] Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py%22%2C+line+172%2C+in+submit_to_graphite%0A++++graphite_socket%5B%27socket%27%5D.sendall%28+%22%25s%25s%22+%25+%28header%2C+payload%29+%29%0A++File+%22%2Fusr%2Flib%2Fpython2.7%2Fsocket.py%22%2C+line+228%2C+in+meth%0A++++return+getattr%28self._sock%2Cname%29%28%2Aargs%29%0Aerror%3A+%5BErrno+32%5D+Broken+pipe%0A
I'll look into this. I have yet to experience the issue myself.
As to the urlencoding. I added that for the traceback output so that those messages can be sent through your standard syslog application that would normally break up multi-line outputs into multiple messages. This ensures that the whole message reaches the remote destinatioin a usable form.
@Ralnoc I think you probably haven't experienced the issue because you don't have enough elasticsearch content that you need to chunk it. Not sure why @mitalhp 's chunking will not work with threading.
@drocsid Could you detail the exact arguments you are using? What health-level? Are you using shard-stats, etc? I need to try and replicate the issue.
python2 ./es2graphite.py --stdout --log-level debug es.server:9200 -g graphite.server -o 2004 .
I also needed to comment out some lines to get the stats into my graphite dashboard. I'm also curious about the round-robin approach.It appears that all the _GET requests use the same elasticsearch host.
What sections did you comment out? Also, I don't follow the question about round robin. They code is always querying the same host, each _get request is for different stats URIs.
There was an stack trace like:
2015-07-02 15:27:13,240 [MainThread] [ERROR ]
Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py
%22%2C+line+290%2C+in+%3Cmodule%3E%0A++++get_metrics%28%29%0A++File+%22.
%2Fes2graphite.py%22%2C+line+240%2C+in+get_metrics%0A++++indices_stats_m
etrics+%3D+process_indices_stats%28args.prefix%2C+indices_stats%29%0A++F
ile+%22.%2Fes2graphite.py%22%2C+line+121%2C+in+process_indices_stats%0A+
+++process_section%28int%28time.time%28%29%29%2C+metrics%2C+%28prefix%2C
+CLUSTER_NAME%2C+%27indices%27%29%2C+stats%5B%27indices%27%5D%29%0ATypeE
rror%3A+process_section%28%29+takes+exactly+5+arguments+%284+given%29%0A
2015-07-02 15:27:13,241 [MainThread] [INFO ] 2015-07-02 15:27:13:
GET
so I had a look at this and it looks like the issue was coming from https://github.com/mattweber/es2graphite/blob/master/es2graphite.py#L119
So I commented the related lines...
Ok. It looks like something is going on in the indices level stat gathering. If you uncomment that section of code and just set --health-level cluster
then you can run it and have it bypass that code. I'll have to run some tests and see why that issue is manifesting.
@drocsid - The issue you were experiencing is different that the one described by @mitalhp . You issue ended up being an issue where the index collection call to process_section wasn't updated with the new format. That issue is moved to #14 . I'm continuing to investigate the broken pipe issue, but I have yet to run into it.
@Ralnoc
The broken pipe is likely due to having a large number of indices and stats from the cluster. I re-used and modified some of the functions from these scripts, but hacked it heavily to create a custom tailored graphite dashboard. I had an interest in different metrics, but this served as a good quickstart entrypoint for me. Unfortunately, I don't think my hacks are polished enough but I might think about checking it in if there's any interest. Thanks.
I can confirm that the broken pipe issue is caused by a large number of indices and stats. To mitigate the issue, I modified the stats URL (L268) to request only the stats that I needed. This reduced the size of the json object and fixed the timeout.
I was getting a broken pipe error, i believe caused by socket size limits when sending to graphite, so I changed the send_to_graphite to chunk up the data which seems to have fixed the issue. Not sure if this is the best way to handle it (it doesn't work with never versions of the script since the threading was added).