jamiealquiza / polymur

A fast carbon-relay with live routing controls + https Graphite forwarder
MIT License
98 stars 12 forks source link

Queue capacity issue #75

Open coreyappleby opened 6 years ago

coreyappleby commented 6 years ago

I've got Polymur running in a Docker container on a Mesos/Marathon cluster but I'm running into an issue with it reaching the carbon-cache backends. It looks like the queue capacity is set to 0, even though I've tried increasing it with the option flags. Here's what I'm using to start Polymur:

/go/bin/polymur -listen-addr "0.0.0.0:2003" -stat-addr "0.0.0.0:2020" -api-addr "0.0.0.0:2030" -distribution "hash-route" -console-out -outgoing-queue-cap 8192 -incoming-queue-cap 65536

It starts up fine, accepts metrics, and the carbon-cache instance is able to register with it and seems to be working correclty. However Polymur's output shows the capacity of the queue is zero and drops all messages destined for the carbon cache even before I send any metrics.

2017/12/12 03:11:44 Adding destination to connection pool: hostname.domain.com:31501
2017/12/12 03:11:44 Destination hostname.domain.com:31501 queue is at capacity (0) - further messages will be dropped

Has anyone run into this before? Am I doing something incorrectly?

jamiealquiza commented 6 years ago

I haven't seen this issue but can help look into it. Are you building Polymur from the current master?

jamiealquiza commented 6 years ago

Also one thing to check into is if the POLYMUR_OUTGOING_QUEUE_CAP env var is being overwritten somehow. Take note of the envy / env flag usage: https://github.com/jamiealquiza/polymur#usage

coreyappleby commented 6 years ago

Yes, the dockerfile pulls down the latest from github to build the image.

coreyappleby commented 6 years ago

I tried overriding the POLYMUR_OUTGOING_QUEUE_CAP directly (instead of using the runtime options) but it still insists it has a queue capacity of zero. I'm honestly stumped.

Also just to confirm I ran it locally (instead of on our mesos cluster) and get the same results. Very strange.

jamiealquiza commented 6 years ago

This is weird, also what version of Go are you building this with? One thing I can try doing is cutting you a branch that reports some diagnostic info that could help figure out what’s going on here.

coreyappleby commented 6 years ago

I'm using 1.8.1 in the Docker image and running it outside Docker with 1.9.2. Both cases do the same thing.

jamiealquiza commented 6 years ago

I've created a branch that prints relevant diagnostic info prefixed with 'xxx' - if you could do a build from this and share the output (if any information is considered sensitive, only the 'xxx' entries are needed): https://github.com/jamiealquiza/polymur/tree/queue-cap-test

coreyappleby commented 6 years ago

Thanks for that! Here's the output from "polymur -console-out"

XXX queue cap env var unset
2017/12/13 23:09:08 ::: Polymur :::
XXX queue cap config: 4096
2017/12/13 23:09:08 Runstats started: localhost:2020
2017/12/13 23:09:08 API started: localhost:2030
2017/12/13 23:09:08 Metrics listener started: 0.0.0.0:2003
2017/12/13 23:09:20 Registered destination hostname.domain.com:31890
2017/12/13 23:09:20 Adding destination to connection pool: hostname.domain.com:31890
XXX queue for hostname.domain.com:31890 set to 0
2017/12/13 23:09:23 Destination hostname.domain.com:31890 queue is at capacity (0) - further messages will be dropped
jlytle-interactions commented 6 years ago

UPDATE: I got it to work! Something is going on with the arguments and the order (or I am missing something). At first I thoughbt it was the "=" between the arg and the value but eventually I found it has something to do with the positional order of the args. Notice the 2 invocations below, when I put -console-out on the end, some of the args don't get picked up. So with some trial-n-error I got it to work.

GOOD

$ /usr/local/go/polymur/bin/polymur-gateway -key "/usr/local/go/polymur/key.pem" -cert "/usr/local/go/polymur/cert.pem" -destinations "x.x.x.x:2203"
2018/03/16 13:39:07 ::: Polymur-gateway :::
2018/03/16 13:39:07 Registered destination x.x.x.x:2203
2018/03/16 13:39:07 Adding destination to connection pool: x.x.x.x:2203
2018/03/16 13:39:08 Running API key sync
2018/03/16 13:39:08 HTTP listening on 0.0.0.0:80
2018/03/16 13:39:08 API started: localhost:2030
2018/03/16 13:39:08 Runstats started: localhost:2020
2018/03/16 13:39:08 HTTPS listening on 0.0.0.0:443
2018/03/16 13:39:08 API keys refreshed: 2 new, 0 removed
2018/03/16 13:39:13 [client xx.xx.xx.xx:36067] Recieved batch from from test-api

BAD

$ /usr/local/go/polymur/bin/polymur-gateway -key "/usr/local/go/polymur/key.pem" -cert "/usr/local/go/polymur/cert.pem" -destinations "x.x.x.x:2203" -console-out
2018/03/16 13:39:20 ::: Polymur-gateway :::
2018/03/16 13:39:20 Running API key sync
2018/03/16 13:39:20 HTTP listening on 0.0.0.0:80
2018/03/16 13:39:20 API started: localhost:2030
2018/03/16 13:39:20 Runstats started: localhost:2020
2018/03/16 13:39:20 HTTPS listening on 0.0.0.0:443
2018/03/16 13:39:20 API keys refreshed: 2 new, 0 removed

OP:

hi i have the same issue - any update on what the cause is? I'm using polymur-gateway fwiw

$ /usr/local/go/polymur/bin/polymur-gateway -key=/usr/local/go/polymur/key.pem -cert=/usr/local/go/polymur/cert.pem -destinations=xxx.xxx.xxx.xxx:2203 -console-out
2018/03/16 12:39:07 ::: Polymur-gateway :::
2018/03/16 12:39:07 Running API key sync
2018/03/16 12:39:07 HTTP listening on 0.0.0.0:80
2018/03/16 12:39:07 API started: localhost:2030
2018/03/16 12:39:07 Runstats started: localhost:2020
2018/03/16 12:39:07 HTTPS listening on 0.0.0.0:443
2018/03/16 12:39:07 API keys refreshed: 2 new, 0 removed
2018/03/16 12:39:10 Registered destination xxx.xxx.xxx.xxx:2203
2018/03/16 12:39:10 Adding destination to connection pool: xxx.xxx.xxx.xxx:2203
2018/03/16 12:39:12 Destination xxx.xxx.xxx.xxx:2203 queue is at capacity (0) - further messages will be dropped

carbon-cache is running on destination host listening on 2203/tcp. On the carbon-cache host I run tcpdump and I can see the initial hit to the port but the error above stops it from retrying. I also see the whisper files initially created but obviously no data in the them.

Also I noticed the -destinations=xxx.xxx.xxx.xxx:2203 arg doesn't do anything? even after registering the destination with the api, a subsequent process restart doesn't re-establish the destination? Is there a step to make the destinations persistent?

In fact, there's a few args that don't seem to get picked up, at least it is not reflected in -console-out output - neither of these seemed to have any affect on the process:

$ /usr/local/go/polymur/bin/polymur-gateway -outgoing-queue-cap=8192 -incoming-queue-cap=65535
$ POLYMUR_GW_OUTGOING_QUEUE_CAP=8192 POLYMUR_GW_INCOMING_QUEUE_CAP=65535 /usr/local/go/polymur/bin/polymur-gateway

for reference:

$ go version
go version go1.9.2 linux/amd64

thanks!!

jamiealquiza commented 6 years ago

Hey @jlytle-interactions, thanks for that extra info! The position of the args shouldn't matter, so there's definitely still something I need to look at. Glad you found a fix, apologies for my slow response today.

jlytle-interactions commented 6 years ago

no worries I appreciate any response tbh and I also want to thank you for your great work! kickA!

-Jason

On Fri, Mar 16, 2018 at 2:19 PM, Jamie Alquiza notifications@github.com wrote:

Hey @jlytle-interactions https://github.com/jlytle-interactions, thanks for that extra info! The position of the args shouldn't matter, so there's definitely still something I need to look at. Glad you found a fix, apologies for my slow response today.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jamiealquiza/polymur/issues/75#issuecomment-373801940, or mute the thread https://github.com/notifications/unsubscribe-auth/Ajt2qypMMOFkZLY5XOHcYVCLmL_C9dduks5tfAIrgaJpZM4Q-a_e .

-- Jason Lytle Sr. Systems Engineer Interactions LLC jlytle@interactions.com o: (774) 235-0151 www.interactions.com

--


This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.


fsperling commented 6 years ago

Hi, I have the same issue. The queue is full although it has 0 messages. And also the destinations parameter doesn't work - I always have to set it with putdest.

2018/10/02 17:57:30 Destination 127.0.0.1:20004 queue is at capacity (0) - further messages will be dropped

I start it like this: $GOPATH/bin/polymur -console-out -destinations="localhost:20003" -listen-addr=0.0.0.0:20033 -metrics-flush=10 -outgoing-queue-cap=50000

I'm running golang-1.7

Also are there other commands apart from "stats" for the runstats api? The following metrics don't show up there: polymur.incoming-queue.current-size 0 1538496248 polymur.incoming-queue.limit 50000 1538496248

Cheers, Felix

zerosoul13 commented 5 years ago

I've been having this same issue and I think i managed to get it running using Docker along with docker-compose

docker-compose.yml

services:
    polymur:
        image: diceone/docker-polymur
        command: ["/go/bin/polymur"]
        environment:
            - POLYMUR_API_ADDR=0.0.0.0:2030
            - POLYMUR_DESTINATIONS=primary:2003,secondary:2003
            - POLYMUR_DISTRIBUTION=broadcast
            - POLYMUR_INCOMING_QUEUE_CAP=32768
            - POLYMUR_OUTGOING_QUEUE_CAP=4096
            - POLYMUR_METRICS_FLUSH=60
        ports:
            - 2222:2003
            - 2030:2030
        links:
            - graphite_primary:primary
            - graphite_secondary:secondary

    grafana:
        image: grafana/grafana
        network_mode: host

    graphite_primary:
        image: sitespeedio/graphite:1.1.3
        ports:
            - 9999:80
            - 2203:2003

    graphite_secondary:
        image: sitespeedio/graphite:1.1.3
        ports:
            - 8888:80
            - 2303:2003

By using this docker-compose file I was able to get things running smoothly. No more Destination 127.0.0.1:2222 queue is at capacity (0) - further messages will be dropped.

Hopefully this helps someone :)