problems with services losing connection to mesh

tswaters commented 7 years ago

Apologies, this a bit long but is the culmination of quite a lot of pain & struggling and debugging with a pretty fatal problem.

--

We've got a bit of a problem, I've been trying myself to debug it but I'm basically at wit's end trying to figure out what the problem is. I have a suspicion it relates to too many pins on mesh and timing issues when a service is actually ready to receive ping requests from swim and those pings failing because it's not ready.

When we started seeing this, the symptoms were that services in the mesh would lose connections. Specifically, we have an auth service that wraps seneca-user and on the front end, an auth pattern is hit on almost every page visit. We would see the front-end service unable to find the auth pattern and a failure with act_not_found. We're still under development so the resolution is to kill everything & start it up again.

The hope was that this was an isolated thing that we would only see during development - but now we're looking to get this running outside of local developers machines, using docker and rancher to spin up all the microservices and we're seeing it there now, when upgrading nodes to the latest version (basically stop, pull latest image, start) and when scaling to multiple instances.

I discovered I could clearly see the problem if I pass the following option on the base service,

seneca.use(SenecaMesh, {
  // etc...
  balance_client: {debug: {client_updates: true}}
}

With this present, we can see when nodes are added and when they are removed. When the symptoms above show things go, for lack of a better word - haywire. This log goes insane with a few messages per second, both add and remove showing pins from all microservices in the mesh.

It spreads like a virus it seems - starting with one node and spreading to others. When it gets in this state, the only way I've seen to resolve it is to kill everything and start it up. When this happens locally running everything (10 microservices), the system slows to a crawl... in rancher, nodes go unhealthy, are removed & re-added and it's basically a perpetual reboot loop.

Our sample configuration for a service looks something like the following:

const seneca = Seneca({
  tag: '...',
  transport: {host: IP_ADDRESS},
})

seneca.use(SenecaMesh, {
  host: IP_ADDRESS,
  bases: [`${MESH_HOST_BASE}:${MESH_HOST_PORT}`],
  listen: ['array of pin strings'].map(pin => ({host: IP_ADDRESS})
})

The liberal use of IP_ADDRESS is to get it working with docker/rancher. The base node is the only one to use a host name because (for now) there is only one of them. The networking is handled by rancher here and we get access to related services by host name. In practice, this falls down when there are multiple services all using the same name, it appends numbers... so we just use IP address in the config

I've omitted pins, but in practice on each service there are 4 model:observe pins for cache clears and service startup notifications, and each service exposes ~7-9 pins in addition to this. The front-end one has an added model:observe for route:set for seneca-web. Each service has a health check setup with a pre-defined port - rancher uses this to perform health checks.

Other notes and things I've noticed in debugging this (fruitlessly):

if a service calls into another while it is booting up, things are far more likely to break. We got around this locally by a liberal usage of timeout in the fuge config - and in rancher, by a clever startup script that will wait for the healthcheck of dependant services to be available before attempting to start.
the model:observe strangely seems to cause the problem more often. After we added service number 10, we weren't able to start up the project at all without hitting this. On the front-end service after receiving the route payload from seneca-web (3 services emit their own routes, including number 10) we'd see a remove/add for pins on the front-end. It seems the cache model:observe would cause it to spin out of control here... commenting those out, we can at least get it started... it still removes/adds route:set and the healthcheck, but it doesn't spin out of control.
looking through the callstack on break points, it seems that nodes are marked as faulty by swim, make their way to sneeze and finally to mesh. Why this is happening is still an unknown to me and may be the underlying cause of the problem - the services should be alive and kicking by the time swim starts polling them, right?
we are running with seneca.fixedargs.fatal$ = false on all services. This is so we can pass back errors in seneca actions without killing the services. I tried removing this to see if things would die properly but this was not the case. There are no errors raised throughout this, just a lot of network traffic and degraded nodes.

--

I have a feeling this can be worked around by providing options to swim by way of sneeze options, but for the life of me, I don't know what the best options are, nor do I know if this is just a stopgap until we add more pins/services and need to increase timeouts again.

tswaters commented 7 years ago

Bit of a follow up here.... I found the two things that were causing our immediate problem of being unable to boot all the services up. First, in the front-end service. We had something along the lines of the following that appeared in the startup of the application:

module.exports = app => new Promise((resolve, reject) => {
    const router = new Router()
    glob('some-files', (err, files) => {
      if (err) { return reject(err) }
      files.forEach(file => router.use(require(file)(app)))
      resolve(router)
    })
}

Changing it like so:

const Route1 = require('./some-path')
const Route2 = require('./some-path')
const Route3 = require('./some-path')
module.exports = app => new Promise((resolve, reject) => {
  const router = new Router()
  router.use(Route1(app))
  router.use(Route2(app))
  router.use(Route3(app))
  resolve(router)
})

made it function properly again. My guess is that the require call here is blocking, and while the event loop is blocked, it can't respond to swim in a timely fashion. As more time progresses in a blocked state, the more likely pins under this service are pinged by swim and marked as faulty.

I still not entirely sure why it goes into a loop once a certain number of pins are reached in a service though. Perhaps the added traffic / processing around marking nodes as failed and re-added is itself blocking response to pings, so things are marked as faulty, further propagating the problem as each service needs to receive the message that it needs to remove the node.

The second was with how the pins were registered, before we might have had something like this:

const pins = [
  {pin: 'domain:someDomain,cmd:someCmd1'},
  {pin: 'domain:someDomain,cmd:someCmd2'},
  {pin: 'domain:someDomain,cmd:someCmd3'},
  {pin: 'domain:someDomain,cmd:someCmd4'},
]

Making it like this:

const pins = [
  {pin: 'domain:someDomain,cmd:*'}
]

Makes it considerably faster to not only spin up, but and I figure helps towards limiting the number of things that swim needs to manage and the resulting traffic storm when a new service is either added or removed removed from the mesh.

ghost commented 7 years ago

@tswaters Did some of the comments above help with issues you were having? Having a similar issue where services are not properly registering to Mesh.

tswaters commented 7 years ago

Comments? Above? I responded to my own post.

I will say the changes I've made do help quite a bit but haven't fixed it 100%. We're not in production yet and mostly focused on development, building the product. Right now we have the luxury of just rebooting everything if it goes into, as I've been affectionately calling it, the seneca mesh reboot loop.

But the point is, it does still happen occasionally... I think once I get some time to properly solve it, I will look into overwriting the default sneeze options passed to swim, passed via sneeze_opts in the seneca-mesh options, see following for defaults: https://github.com/rjrodger/sneeze/blob/master/sneeze.js#L93-L102

There's also one other option that can't currently be set -- see https://github.com/mrhooray/swim-js/issues/16 ... and hey, look at that, someone sent a PR! Hopefully that gets fixed and @rjrodger bumps the fixed swim dependency in sneeze.

If you're experiencing this issue, this very specific issue, I'd first look to ensure you don't have any stray blocking or otherwise cpu intensive tasks that can cause the microservice to fail to respond to polls in a timely fashion. I mean, the services do work most of the time and get properly registered and everything works... swimmingly. It's only sometimes things go haywire.

The default timeout to get marked as suspect is controlled by pingReqTimeout which is 444ms, so the process needs to receive the request, respond and the process that sent the poll needs to receive it... or the node is marked as suspect. If you can avoid the cause, it should go away. I'm almost tempted to bump it to like 2 seconds across all services to see if that fixes it 100%.

Still not sure why it goes into a tailspin if one service goes unhealthy.... that doesn't seem right at all. I'm going to have to read that swim paper again to figure out what disseminationFactor is.

otaviosoares commented 7 years ago

@tswaters I have similar issues here. I've also opened an issue https://github.com/senecajs/seneca-mesh/issues/48

I thought I had solved it by using multilpe base services and consul registry but it's still happening.

Right now we have the luxury of just rebooting everything if it goes into, as I've been affectionately calling it, the seneca mesh reboot loop.

Yeah, I know your pain. I've created a fixmesh bash script that recreates all services. We're in production and I hope we can solve this as soon as possible.

MikeLindenau commented 7 years ago

@tswaters you using transport type web (default)?

tswaters commented 7 years ago

Yes, we are using web....

We actually tried to deploy a slew of application changes to our shared dev environment for the end of sprint today, and everything blew up -- so priority to get it fixed grew a few sizes.

We tried adding the following to the seneca-mesh options -- bumping up the timeouts that swim uses, and this appears to have fixed it:

sneeze: {swim: {
  interval: 500,
  joinTimeout: 2000,
  pingTimeout: 2000,
  pingReqTimeout: 2000
}}

We do still get the occasional add/remove showing up in the base node's log when upgrading services, but the whole mesh doesn't go into a tailspin anymore, so that's nice.

I'm going to keep this ticket open for now... really I think the default options should have higher defaults for swim/sneeze timeouts.

otaviosoares commented 7 years ago

@tswaters is this workaround still working?

Tks

tswaters commented 7 years ago

Yes it appears to be working quite well.

One thing is that joinTimeout puts a timeout when joining. If you have a lot of mesh pins, probably best to crank that down to 0 for development... unsure if it affects the health of the mesh as a whole.

rjrodger commented 7 years ago

@tswaters @otaviosoares @MikeLindenau I'm updating seneca-mesh today with the latest version of swim-js. The timeout approach is indeed the correct approach - swim needs to be calibrated to your network.

Also, you may have hit the datagram max size if you had a lot of pins.

I added a monitoring option - see README - this is meant for dev, but would work in production too for short tests - it will show you directly if services are "flapping".

I've hit this issue myself in a different way - mismatched hostnames, so it does need more care and attention from seneca-mesh.

Could I ask for suggestions on what would make this easier to debug?

tswaters commented 7 years ago

I'll have to take a look at this monitoring option - that looks incredibly useful, thanks @rjrodger

In terms of debugging I would say when a service loses a node in the mesh, log a warning. Without balance_client: {debug: {client_updates: true}} in the options, there is no way to see something is wrong unless one is looking at top and seeing the CPU pinned, or actions returning with act_not_found.... client_updates can be a bit chatty and I really only have it on the base node... it would be nice to see specifically "mesh pin xxx has been marked faulty" on the node that marked it faulty without passing additional options.

Also docs and/or best practices would be incredibly helpful.... maybe a flashing, blinking marquee on the readme that says "if you use this in a non-trivial project, you will need to configure swim for your network"... As it stands, there's no information about sneeze/swim opts... I had to dig into the code to figure out that I could even pass options along to sneeze/swim, then needed to dig further to find out what the options were..... and even further into the swim paper to find out what it all meant. Maybe a advanced swim for dummies page on the wiki or something.

rjrodger commented 7 years ago

@tswaters Good ideas! Leaving this issue open as important.

vforv commented 7 years ago

I have same problem when deploy services into docker swarm. It works few minutes and after that I cannot access to sevice trought api.

danielo515 commented 6 years ago

In my case I can't even make my services communicate. Anyone has a working example with rancher?

danielo515 commented 6 years ago

Nobody has any guidance or insight ? I don't want to open a new issue

otaviosoares commented 6 years ago

@danielo515 Please, give some code examples of what you trying to do. I might be able to help.

tswaters commented 6 years ago

Rancher, eh? We used to use seneca-mesh in a rancher environment across 3 hosts. We found one day after we updated rancher that cross host communication stopped working. This was from 1.3 to 1.4 I think... almost a year ago now.

For some reason the UDP packets that swim was sending out weren't getting received on the target nodes -- they would get marked unhealthy and everything went sour. I have a feeling it was the network overlays being weird and blocking communication but wasn't able to figure it out in the end.

We rolled that rancher update back (rebuilt the entire environment) and proceeded to drop seneca-mesh in code. We switched to amqp for consume, and redis transport for observe. This wasn't a small change by any stretch, introduced new infrastructure pieces and isn't without its own problems.... but it seems to work.

danielo515 commented 6 years ago

Hello @otaviosoares and @tswaters ,

I collected my problems and findings on this stack overflow question: https://stackoverflow.com/questions/50930996/using-seneca-mesh-on-rancher

I any case here is my mesh configuration, which is specifically tuned to see SWIM traces:

{
    pins: ['role:mesh,cmd:test'],
    host: '10.58.58.58',
    isbase: true,
    port: '39999',
    bases: ['10.40.40.1:39999', 'base-hostname:39001'],
    stop: false,
    balance_client: { debug: { client_updates: true } },
    jointime: 2000,
    sneeze:
    {
        silent: false,
        swim: { joinTimeout: 2777, pingTimeout: 2444, pingReqTimeout: 2333 }
    },
    discover:
    {
        custom: { active: true, find: dnsSeed },
        multicast: { active: false },
        registry: { active: false }
    }
}

@tswaters I think we run rancher 1.6 and our network in the containers is configured as managed. In any case, the packets are reaching the bases, as I stated on the SOF question, I can see how the base receives the join requests from the nodes, but for some reason the nodes does not stop sending join requests and at some point they start to send remove requests and at the end they die because timeout. Funnily enough, if I tag all micro-services as isbase:true they join the mesh correctly, so it should not be a networking or communication problem, but some kind of timeout issue or something like that.

danielo515 commented 6 years ago

Finally I come to a solution, but after trying soo many different configurations I'm not sure which one was the key. But, I'm almost sure that providing a custom function for base-discovery is what did the trick. So make sure to:

Do not provide a bases array, if you do then your custom function will never be called, even if you specify stop:false
Provide a custom discover function with whatever mechanism you want (dns, rancher query, whatever) but make sure to return an array of ip:port strings. They must be IPs

That did the trick for me, services are stable and communication works properly.

danielo515 commented 6 years ago

I just opened an issue to the underlying swim library. They use UDP for hello signaling, and that protocol has it's limitations. Currently our hello messages are bigger than the allowed MTU, and because of that some of them are being dropped while some others are split and some chunks get lost.

Is there an alternative to use Seneca-mesh with another discovering mechanism ? Maybe a centralized message broker like redis and then just use direct communication ? Or maybe the other transports work that way ?

Regards

senecajs / seneca-mesh

problems with services losing connection to mesh #75