redgeoff / spiegel

Scalable replication and change listening for CouchDB
MIT License
144 stars 19 forks source link

JSON parse error shortly after replicator start #71

Closed ElmoVanKielmo closed 6 years ago

ElmoVanKielmo commented 6 years ago

I've got the following issue. I tried to play with replicator, ran spiegel --type install -url https://username:password@couchdb-url before starting replicator - that went through smoothly. I created a document in spiegel database like this:

{
  "_id": "identity-index:contact-index",
  "_rev": "2-a01874be74d08a65bfbcf9a52e959287",
  "source": "https://username@couchdb-url/identity-index",
  "target": "https://username@couchdb-url/contact-index",
  "continuous": true
}

Then tried to run replicator like this:

spiegel --type replicator --url='https://username:password@couchdb-url' --passwords-file=replicator-passwords.json --log-level=debug

and got the following error:

{
  "name": "spiegel",
  "hostname": "develop",
  "pid": 31363,
  "level": 60,
  "err": {
    "message": "Invalid JSON (Unexpected \"h\" at position 1 in state STOP)",
    "name": "Error",
    "stack": "Error: Invalid JSON (Unexpected \"h\" at position 1 in state STOP)\n    at Parser.proto.charError (/home/lbiernot/Develop/spiegel/node_modules/jsonparse/jsonparse.js:90:16)\n    at Parser.proto.write (/home/lbiernot/Develop/spiegel/node_modules/jsonparse/jsonparse.js:267:27)\n    at Stream.<anonymous> (/home/lbiernot/Develop/spiegel/node_modules/JSONStream/index.js:23:12)\n    at Stream.stream.write (/home/lbiernot/Develop/spiegel/node_modules/through/index.js:26:11)\n    at Request.ondata (internal/streams/legacy.js:16:26)\n    at Request.emit (events.js:164:20)\n    at IncomingMessage.<anonymous> (/home/lbiernot/Develop/spiegel/node_modules/request/request.js:1082:12)\n    at IncomingMessage.emit (events.js:159:13)\n    at IncomingMessage.Readable.read (_stream_readable.js:478:10)\n    at flow (_stream_readable.js:849:34)\n    at resume_ (_stream_readable.js:831:3)\n    at _combinedTickCallback (internal/process/next_tick.js:138:11)\n    at process._tickCallback (internal/process/next_tick.js:180:9)"
  },
  "msg": "Invalid JSON (Unexpected \"h\" at position 1 in state STOP)",
  "time": "2017-12-06T12:08:17.112Z",
  "src": {
    "file": "/home/lbiernot/Develop/spiegel/node_modules/spiegel/src/log.js",
    "line": 10,
    "func": "log.fatal"
  },
  "v": 0
}
redgeoff commented 6 years ago

Hey @ElmoVanKielmo! I've only seen this error when CouchDB gets into an error state and starts spitting out bad JSON. I've fixed it by rebooting CouchDB.

Also, just to make sure:

ElmoVanKielmo commented 6 years ago

Thanks for the reply. We're on CouchDB 2.1.1 cluster and it doesn't give other errors. We have multiple clients (PouchDB) syncing with the cluster seamlessly. However I accept your explanation. The only concern is that this error causes spiegel to terminate. So, update-listener is mandatory for replicator to work, right?

ElmoVanKielmo commented 6 years ago

I've discussed the issue with my peer and I need to amend my statement. If spiegel really receives invalid JSON from CouchDB (which we have no evidence for as according to our logs CouchDB is running seamlessly) then it should output a clear error message to stderr (or logfile), ideally including the response body from CouchDB. The current traceback says nothing about where did it get "h" at position 1 from. And replicator definitely must not terminate in this case.

redgeoff commented 6 years ago

@ElmoVanKielmo did you try restarting CouchDB? I've typically only seen this error when CouchDB starts spitting out Erlang errors instead of JSON and a restart usually fixes it.

As for a better error, I agree it would be nice, but this is currently the only error that JSONStream gives: https://github.com/redgeoff/quelle/blob/06310550ddd53568d1813a847ac1166f41abf74c/scripts/persistent-stream-iterator.js#L86. I'd welcome a PR if you can figure out a better to report the error :D

ElmoVanKielmo commented 6 years ago

As for the background - We are on production stage with CouchDB cluster and all other software deployed to AWS cloud. We have ELK logging infrastructure in place together with Zabbix, Prometheus and I'm pretty sure we would notice CouchDB errors. Nevertheless we want to try your solution on dev environment (which is a mirror of production with the same monitoring stuff). That said, yes I tried restarting CouchDB (all cluster nodes) which didn't really help with this issue. My colleague should join this thread shortly and he's NodeJS developer, so he might help with the logs. The most important thing however is that replicator must not die because of JSON parsing error. Even if we really missed something and CouchDB really threw bad response, I'd expect to log the error and proceed/retry. Thank you for quick responses.

redgeoff commented 6 years ago

If you run Spiegel via Docker then it will automatically restart and continue working (assuming the error is not permanent). If you don’t want to use docker then you can use systemd or something similar to take care of the restarts. On Thu, Dec 7, 2017 at 7:39 AM ElmoVanKielmo notifications@github.com wrote:

As for the background - We are in touch with CouchDB creators, including https://github.com/janl who visits our company when we need him, but we already have a lot of experience with CouchDB and we know why your approach is better for our use case than the built in replicator. But we need something reliable. We are on production stage with CouchDB cluster and all other software deployed to AWS cloud. We have ELK logging infrastructure in place together with Zabbix, Prometheus and I'm pretty sure we would notice CouchDB errors. Nevertheless we want to try your solution on dev environment (which is a mirror of production with the same monitoring stuff). That said, yes I tried restarting CouchDB (all cluster nodes) which didn't really help with this issue. My colleague should join this thread shortly and he's NodeJS developer, so he might help with the logs. The most important thing however is that replicator must not die because of JSON parsing error. Even if we really missed something and CouchDB really threw bad response, I'd expect to log the error and proceed/retry. Thank you for quick responses.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/redgeoff/spiegel/issues/71#issuecomment-350004490, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOPKgSuDNWklp7AgwuLfZMbBwe_aZQXks5s-Aa1gaJpZM4Q34Nr .

ElmoVanKielmo commented 6 years ago

Yes, we are using Docker but we can't take your solution with Docker Swarm as we are using kubernetes. Therefore the first tests are done without Docker and when it's promising we can create Helm chart for kubernetes and a Docker image with our hardening included. K8s will also restart the container - the problem is that on production frequent restarts will trigger alarm waking up our 24/7 2nd level support in the middle of the night ;)

redgeoff commented 6 years ago

Let me think about this. I think the retrying could happen in Spiegel for this error with a rather small enhancement. On Thu, Dec 7, 2017 at 8:04 AM ElmoVanKielmo notifications@github.com wrote:

Yes, we are using Docker but we can't take your solution with Docker Swarm as we are using kubernetes. Therefore the first tests are done without Docker and when it's promising we can create Helm chart for kubernetes and a Docker image with our hardening included. K8s will also restart the container - the problem is that on production frequent restarts will trigger alarm waking up our 24/7 2nd level support in the middle of the night ;)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/redgeoff/spiegel/issues/71#issuecomment-350011974, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOPKtBqudhCT93OWcClYAfW8xI7MCF3ks5s-Av1gaJpZM4Q34Nr .

redgeoff commented 6 years ago

So it turns out that Spiegel (via slouch/quelle) does retry when it receives these Invalid JSON errors: https://github.com/redgeoff/quelle/blob/master/scripts/persistent-stream.js#L102. I tried a few things like abruptly terminating CouchDB to try to generate the Invalid JSON error and I cannot seem to reproduce it. Perhaps you or your colleague could trace things through to see where the error is originating? Also, I’d recommend looking at the CouchDB logs to make sure there are no errors in there.

I’ve just released a new version of Spiegel that provides several fixes and may just help to narrow down the issue as it provides a little more detail in the logs: v0.1.0

ElmoVanKielmo commented 6 years ago

We will try the new version

redgeoff commented 6 years ago

@ElmoVanKielmo I hope the latest version was able to solve your problem. I'm going to close this issue now as there hasn't been any more from you. Thanks!

dmakovec commented 6 years ago

Hi @redgeoff, I'm running 1.0.3 and am seeing an identical issue as reported above when running in ECS with CouchDB behind an AWS ELB. I suspect this may be because the load balancer has a shortish idle timeout, so it eventually drops the connection during a long poll, resulting in Spiegel erroring out.

I see that you've made updates to slouch and quelle and there's a possibility that these may help with this this problem. Would it be possible to update Spiegel with your latest versions of those libraries to see if it makes a difference?

Thanks!

redgeoff commented 6 years ago

Hey @dmakovec! The latest version of Spiegel actually contains all the latest fixes in slouch and quelle.

It sounds reasonable that your load balancer could be causing these errors. I seem to remember that the biggest gotcha with the AWS ELB is that the default timeout is around 60 seconds and this happens to be the default timeout for a CouchDB feed. I fixed this by simply extending the AWS ELB timeout to something like 120 seconds.