mattweber / es2graphite

Send elasticsearch metrics to graphite
MIT License
46 stars 22 forks source link

socket.error: [Errno 32] Broken pipe #2

Open mitalhp opened 10 years ago

mitalhp commented 10 years ago

I was getting a broken pipe error, i believe caused by socket size limits when sending to graphite, so I changed the send_to_graphite to chunk up the data which seems to have fixed the issue. Not sure if this is the best way to handle it (it doesn't work with never versions of the script since the threading was added).

def chunks(data, size):
    for i in xrange(0, len(data), size):
        yield data[i:i+size]

def send_to_graphite(metrics, chunksize=500):
    if args.debug:
        for m, mval  in metrics:
            log('%s %s = %s' % (mval[0], m, mval[1]), True)
    else:
        if chunksize:
            chunked_metrics = list(chunks(metrics, chunksize))
        else:
            chunked_metrics = list(metrics)

        log('total %s chunks of %s size' % (len(chunked_metrics), chunksize))
        for c in chunked_metrics:
                log('sending chunk')
                payload = pickle.dumps(c)
                header = struct.pack('!L', len(payload))
                sock = socket.socket()
                sock.connect((args.graphite_host, args.graphite_port))
                sock.sendall('%s%s' % (header, payload))
                sock.close()
apple-corps commented 9 years ago

I just saw the same myself with the latest version.

2015-07-07 16:03:09,444 [MainThread es2graphite.py :submi:174] [ERROR ] Communication to Graphite server failed: [Errno 32] Broken pipe

apple-corps commented 9 years ago

What's with the debug messages being url encoded anyhow?

2015-07-07 16:06:09,224 [MainThread es2graphite.py :submi:175] [DEBUG ] Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py%22%2C+line+172%2C+in+submit_to_graphite%0A++++graphite_socket%5B%27socket%27%5D.sendall%28+%22%25s%25s%22+%25+%28header%2C+payload%29+%29%0A++File+%22%2Fusr%2Flib%2Fpython2.7%2Fsocket.py%22%2C+line+228%2C+in+meth%0A++++return+getattr%28self._sock%2Cname%29%28%2Aargs%29%0Aerror%3A+%5BErrno+32%5D+Broken+pipe%0A

Ralnoc commented 9 years ago

I'll look into this. I have yet to experience the issue myself.

As to the urlencoding. I added that for the traceback output so that those messages can be sent through your standard syslog application that would normally break up multi-line outputs into multiple messages. This ensures that the whole message reaches the remote destinatioin a usable form.

apple-corps commented 9 years ago

@Ralnoc I think you probably haven't experienced the issue because you don't have enough elasticsearch content that you need to chunk it. Not sure why @mitalhp 's chunking will not work with threading.

Ralnoc commented 9 years ago

@drocsid Could you detail the exact arguments you are using? What health-level? Are you using shard-stats, etc? I need to try and replicate the issue.

apple-corps commented 9 years ago

python2 ./es2graphite.py --stdout --log-level debug es.server:9200 -g graphite.server -o 2004 .

I also needed to comment out some lines to get the stats into my graphite dashboard. I'm also curious about the round-robin approach.It appears that all the _GET requests use the same elasticsearch host.

Ralnoc commented 9 years ago

What sections did you comment out? Also, I don't follow the question about round robin. They code is always querying the same host, each _get request is for different stats URIs.

apple-corps commented 9 years ago

There was an stack trace like:

 2015-07-02 15:27:13,240 [MainThread] [ERROR   ] 
     Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py
    %22%2C+line+290%2C+in+%3Cmodule%3E%0A++++get_metrics%28%29%0A++File+%22.
    %2Fes2graphite.py%22%2C+line+240%2C+in+get_metrics%0A++++indices_stats_m
    etrics+%3D+process_indices_stats%28args.prefix%2C+indices_stats%29%0A++F
    ile+%22.%2Fes2graphite.py%22%2C+line+121%2C+in+process_indices_stats%0A+
    +++process_section%28int%28time.time%28%29%29%2C+metrics%2C+%28prefix%2C
    +CLUSTER_NAME%2C+%27indices%27%29%2C+stats%5B%27indices%27%5D%29%0ATypeE
    rror%3A+process_section%28%29+takes+exactly+5+arguments+%284+given%29%0A
    2015-07-02 15:27:13,241 [MainThread] [INFO    ]  2015-07-02 15:27:13:
    GET        

so I had a look at this and it looks like the issue was coming from https://github.com/mattweber/es2graphite/blob/master/es2graphite.py#L119

So I commented the related lines...

Ralnoc commented 9 years ago

Ok. It looks like something is going on in the indices level stat gathering. If you uncomment that section of code and just set --health-level cluster then you can run it and have it bypass that code. I'll have to run some tests and see why that issue is manifesting.

Ralnoc commented 9 years ago

@drocsid - The issue you were experiencing is different that the one described by @mitalhp . You issue ended up being an issue where the index collection call to process_section wasn't updated with the new format. That issue is moved to #14 . I'm continuing to investigate the broken pipe issue, but I have yet to run into it.

apple-corps commented 9 years ago

@Ralnoc

The broken pipe is likely due to having a large number of indices and stats from the cluster. I re-used and modified some of the functions from these scripts, but hacked it heavily to create a custom tailored graphite dashboard. I had an interest in different metrics, but this served as a good quickstart entrypoint for me. Unfortunately, I don't think my hacks are polished enough but I might think about checking it in if there's any interest. Thanks.

AlexClineBB commented 8 years ago

I can confirm that the broken pipe issue is caused by a large number of indices and stats. To mitigate the issue, I modified the stats URL (L268) to request only the stats that I needed. This reduced the size of the json object and fixed the timeout.