Unable to match pattern after spawning a new base node after killing all.

senecajs / seneca-mesh

Mesh your Seneca.js microservices together - no more service discovery!

MIT License

142 stars 47 forks source link

Unable to match pattern after spawning a new base node after killing all. #74

Open varunnayal opened 7 years ago

varunnayal commented 7 years ago

Steps to reproduce: i) Start a base node(only one) ii) Register services to the network iii) Call any existing action which would yield output iv) Kill the base node server and start it again v) Call the api in (iii) and it would result in error No matching action pattern found...

Consider the example given in examples/20-local-dev-mesh.

To start the service we ran api-service.js first as it acts as the base node. This also opens up a hapi server on port 8000.

node api-service

Then rest of the services (hex|rgb)-color-service.js were run which register themselves. node hex-color-service node rgb-color-service

Now a valid curl call curl http://localhost:8000/api/color/hex?color=green would give correct result {"color":"#008000","format":"hex"}.

However if the api-service is killed and run again on the same port in which it was serving before, then the same curl call would fail.

otaviosoares commented 7 years ago

As far as I'm concerned, in this example when the base is killed and restarted it doesn't join the existing network.

To solve this you can run multiple base services, so if one of them is restarted it still has an entry point to the existing network.

In order to achieve this, you must set the bases option on the base service as well. Ramanujan has a good example of it. In production you'd like to use seneca-consul-registry to keep record of your bases ips.

varunnayal commented 7 years ago

Hi @otaviosoares, Thanks for the insight. I am aware of the approach of using multiple bases. The problem I reported was actually a generalisation of the scenario where N base node of a system being down and then restarting them all or one which should keep the system running.

Though it has infinitely small probability of occurring, but wanted to know/confirm of it being a known case or not.

rjrodger commented 7 years ago

@varunnayal @otaviosoares yeah this is expected behavior - the new base has no knowledge of the old network, so you end up with split brain.

In production, you need to run multiple bases to have redundancy against this. In the https://github.com/senecajs/ramanujan example I use two, and they know about each other from hard-coded config.

In the more general production case, a base node joining a network should always be given the location of another node so it can join the current network. If you loose all your base nodes, then you'll need to start at least one that points at a worker node as if it was a base.

I'll make a note to write a test case for this.