Closed ElmoVanKielmo closed 6 years ago
Hey @ElmoVanKielmo! I've only seen this error when CouchDB gets into an error state and starts spitting out bad JSON. I've fixed it by rebooting CouchDB.
Also, just to make sure:
Thanks for the reply. We're on CouchDB 2.1.1 cluster and it doesn't give other errors. We have multiple clients (PouchDB) syncing with the cluster seamlessly. However I accept your explanation. The only concern is that this error causes spiegel to terminate. So, update-listener is mandatory for replicator to work, right?
I've discussed the issue with my peer and I need to amend my statement. If spiegel really receives invalid JSON from CouchDB (which we have no evidence for as according to our logs CouchDB is running seamlessly) then it should output a clear error message to stderr (or logfile), ideally including the response body from CouchDB. The current traceback says nothing about where did it get "h" at position 1 from. And replicator definitely must not terminate in this case.
@ElmoVanKielmo did you try restarting CouchDB? I've typically only seen this error when CouchDB starts spitting out Erlang errors instead of JSON and a restart usually fixes it.
As for a better error, I agree it would be nice, but this is currently the only error that JSONStream gives: https://github.com/redgeoff/quelle/blob/06310550ddd53568d1813a847ac1166f41abf74c/scripts/persistent-stream-iterator.js#L86. I'd welcome a PR if you can figure out a better to report the error :D
As for the background - We are on production stage with CouchDB cluster and all other software deployed to AWS cloud. We have ELK logging infrastructure in place together with Zabbix, Prometheus and I'm pretty sure we would notice CouchDB errors. Nevertheless we want to try your solution on dev environment (which is a mirror of production with the same monitoring stuff). That said, yes I tried restarting CouchDB (all cluster nodes) which didn't really help with this issue. My colleague should join this thread shortly and he's NodeJS developer, so he might help with the logs. The most important thing however is that replicator must not die because of JSON parsing error. Even if we really missed something and CouchDB really threw bad response, I'd expect to log the error and proceed/retry. Thank you for quick responses.
If you run Spiegel via Docker then it will automatically restart and continue working (assuming the error is not permanent). If you don’t want to use docker then you can use systemd or something similar to take care of the restarts. On Thu, Dec 7, 2017 at 7:39 AM ElmoVanKielmo notifications@github.com wrote:
As for the background - We are in touch with CouchDB creators, including https://github.com/janl who visits our company when we need him, but we already have a lot of experience with CouchDB and we know why your approach is better for our use case than the built in replicator. But we need something reliable. We are on production stage with CouchDB cluster and all other software deployed to AWS cloud. We have ELK logging infrastructure in place together with Zabbix, Prometheus and I'm pretty sure we would notice CouchDB errors. Nevertheless we want to try your solution on dev environment (which is a mirror of production with the same monitoring stuff). That said, yes I tried restarting CouchDB (all cluster nodes) which didn't really help with this issue. My colleague should join this thread shortly and he's NodeJS developer, so he might help with the logs. The most important thing however is that replicator must not die because of JSON parsing error. Even if we really missed something and CouchDB really threw bad response, I'd expect to log the error and proceed/retry. Thank you for quick responses.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/redgeoff/spiegel/issues/71#issuecomment-350004490, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOPKgSuDNWklp7AgwuLfZMbBwe_aZQXks5s-Aa1gaJpZM4Q34Nr .
Yes, we are using Docker but we can't take your solution with Docker Swarm as we are using kubernetes. Therefore the first tests are done without Docker and when it's promising we can create Helm chart for kubernetes and a Docker image with our hardening included. K8s will also restart the container - the problem is that on production frequent restarts will trigger alarm waking up our 24/7 2nd level support in the middle of the night ;)
Let me think about this. I think the retrying could happen in Spiegel for this error with a rather small enhancement. On Thu, Dec 7, 2017 at 8:04 AM ElmoVanKielmo notifications@github.com wrote:
Yes, we are using Docker but we can't take your solution with Docker Swarm as we are using kubernetes. Therefore the first tests are done without Docker and when it's promising we can create Helm chart for kubernetes and a Docker image with our hardening included. K8s will also restart the container - the problem is that on production frequent restarts will trigger alarm waking up our 24/7 2nd level support in the middle of the night ;)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/redgeoff/spiegel/issues/71#issuecomment-350011974, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOPKtBqudhCT93OWcClYAfW8xI7MCF3ks5s-Av1gaJpZM4Q34Nr .
So it turns out that Spiegel (via slouch/quelle) does retry when it receives these Invalid JSON errors: https://github.com/redgeoff/quelle/blob/master/scripts/persistent-stream.js#L102. I tried a few things like abruptly terminating CouchDB to try to generate the Invalid JSON error and I cannot seem to reproduce it. Perhaps you or your colleague could trace things through to see where the error is originating? Also, I’d recommend looking at the CouchDB logs to make sure there are no errors in there.
I’ve just released a new version of Spiegel that provides several fixes and may just help to narrow down the issue as it provides a little more detail in the logs: v0.1.0
We will try the new version
@ElmoVanKielmo I hope the latest version was able to solve your problem. I'm going to close this issue now as there hasn't been any more from you. Thanks!
Hi @redgeoff, I'm running 1.0.3 and am seeing an identical issue as reported above when running in ECS with CouchDB behind an AWS ELB. I suspect this may be because the load balancer has a shortish idle timeout, so it eventually drops the connection during a long poll, resulting in Spiegel erroring out.
I see that you've made updates to slouch and quelle and there's a possibility that these may help with this this problem. Would it be possible to update Spiegel with your latest versions of those libraries to see if it makes a difference?
Thanks!
Hey @dmakovec! The latest version of Spiegel actually contains all the latest fixes in slouch and quelle.
It sounds reasonable that your load balancer could be causing these errors. I seem to remember that the biggest gotcha with the AWS ELB is that the default timeout is around 60 seconds and this happens to be the default timeout for a CouchDB feed. I fixed this by simply extending the AWS ELB timeout to something like 120 seconds.
I've got the following issue. I tried to play with replicator, ran
spiegel --type install -url https://username:password@couchdb-url
before starting replicator - that went through smoothly. I created a document inspiegel
database like this:Then tried to run replicator like this:
and got the following error: