kalisio / feathers-distributed

Distribute your Feathers services as microservices
MIT License
142 stars 27 forks source link

Infinite event propagation #84

Closed Elytum closed 3 years ago

Elytum commented 3 years ago

Feathers-distributed doesn't seem to work when services have the same name, which can happen because of a redundant exposed authentication service (For example when using RS256 algorithm) , horizontal scaling, ...

How to reproduce:

Run the example with the following setup (Export DEBUG as "feathers-distributed" to explicitly show the issue):

Gateway on port 3030 Service on port 3031 Service on port 3032

1) Create a user (Post: http://localhost:3030/users) 2) Get a token (Post: http://localhost:3030/authentication) 3) Create a todo (Post: http://localhost:3030/todos)

You should see in the logs of all three servers a recursive series of "Publishing created local service" then "Dispatching created remote service", while your ram and cpu die a slow and painful death.

claustres commented 3 years ago

Which versions of Node, feathers-distributed, etc. are you using? Indeed the example seems to work fine on my setup. Moreover, we are using feathers-distributed in production with multiple replicas of the same services successfully and I know that some others also succeeded with this kind of setup.

To make this work fine whenever your services restart you might need to fine-tune the setup, see https://github.com/kalisio/feathers-distributed/issues/70. We also did not yet succeed in distributing the user service, see https://github.com/kalisio/feathers-distributed/issues/12, any help is welcome.

Elytum commented 3 years ago

Yes, I guess it's a dependancy issue, but the thing is I tried to only run "npm ci" to use the versions specified in the repository. Also, I tested the project a while ago and everything was fine (Even through I didn't look for debug logs), I also suspect a configuration issue on my side, but I fail to see what and why.

Here's my setup:

To import the examples, I'm just cloning the git project and going into the "example" directory to launch the apps manually. I'm running "npm ci" to install the versions specified in the locked files. Once for the library itself on the main directory, once in "gateway" and once in "service".

Here's the debug logs (Obviously, the file has been trimmed to 100 lines, the loop being able to go on indefinitely) : For gateway: https://pastebin.com/GSC4NLzF For service 0: https://pastebin.com/Ntxagd2j For service 1: https://pastebin.com/QTGrT5Nw

I continue to investigate the issue. I was also looking into the user distribution issue, I'll try to give a hand soon.

claustres commented 3 years ago

Are you using Docker or a specific network setup? Indeed we configure cote by default to perform port scanning starting at 10000. If cote detects a port in use it should switch to another one (it actually relies on port-finder for this). Moreover, port conflict should result in apps unable to join each other, so this behaviour is really strange.

To run the example simply launch npm start in the root folder, this should do the install and launch a gateway and a service. For a more complex scenario with 2 replicas of the service use Docker compose. In order to debug tricky things on the network setup, we have built this small "hello world" script for cote.

I am afraid our current Nodejs target is v12, never tested higher version.

Elytum commented 3 years ago

(I've removed my answer saying the problem was fixed because after some more tests, freeing port 5555 actually didn't solve anything) I'm switching back to node v12 to do some tests. I'm running the services locally and manually following the README (https://github.com/kalisio/feathers-distributed/blob/master/example/README.md)

Elytum commented 3 years ago

Update: Downgrading node to "v12.21.0" and npm to "6.14.11" didn't solve the issue. The fact that all services are run locally on my machine without docker might be the issue, any idea as to how ? I'll try to reproduce the issue with docker. I'm running on mac "10.13.6".

claustres commented 3 years ago

Usually problems occur with Docker not in local network environment. I've just updated the Docker images of the example to Node v12 and tested it fine on my environment. Maybe the problem comes from Mac, I never tested it personally.

Elytum commented 3 years ago

I've ran some tests and did reproduce the issue with my mac, an ubuntu docker on a mac and an ubuntu on AWS's EC2, which brings me to this pretty silly question, is cote (And therefore, feathers-distributed) intended to be usable with multiple servers on the same host ? If it is in fact meant to be usable on the same machine, I can share the docker reproducing the issue on ubuntu 20 if it helps.

claustres commented 3 years ago

I don't really get you, cote aims at enabling communication between servers no matter if they are located on the same host or not. What could be a problem is if multicasting/broadcasting on the network is not allowed (eg Cloud provider) by default. In this case you will need a centralized discovery setup.

By the way in my company, we are using multiple servers on the same host or in a cluster without a problem, either locally or on different cloud providers: AWS, OVH, Scaleway.

You first need to find a working setup with cote (you can try our "hello world" script. Then you can probably try to run feathers-distributed when everything is fine.

claustres commented 3 years ago

Are you experiencing the issue with the provided gateway example "as is" or did you build your own app? Are you using the docker-compose file to run the services or you launch it manually? You should give more information if you'd like some help.

There is a strange message in your logs Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency (probably not related however).

Elytum commented 3 years ago

Sorry, my question was a bit stupid but I was trying to figure out the issue and wondered if that was the cause of it, because just like you stated, cote is supposed to work in each case. Discovery doesn't appear to be the issue, since servers do manage to discover each other. The issue seems to be only about the events bouncing back indefinitely, resulting in exponentially increasing ram usage until the servers are down due to a self DDOS.

I've just ran the "hello world" script and discovery looks fine with either two responders and one requester or one responder and two requesters in each case (Local machine, Docker on local machine and AWS with ubuntu).

The main issue isn't about discovery, propagating the request from gateway to service or even getting a response, all of this works incredibly well and that's what I love about both feathers-distributed and cote. The issue is that events bounce back as if there were no condition for event re-propagation on the receiving end.

The first time I discovered the issue was on my own project. Which is why I then cloned the example gateway "as is", only using "npm ci" (Once in the root directory to export the library into the server's "lib" folder, and once for each server) to keep the dependencies just like they were set in the uploaded lock file. Docker-compose wasn't used, instead I went the manual way. I therefore have three shells, each launching a service using "npm start": One for gateway (Port 3030) and two for service (Port 3031 and 3032). Each instance has DEBUG set to "feathers-distributed" to show discovery and event propagation explicitly.

The scenario is pretty simple to reproduce the issue once the servers are setup like explained above:

Here's a short video of me reproducing the issue from start to finish

I will inspect "padLevels" issue.

PS: When launched with a single gateway and a single service, everything works fine.

claustres commented 3 years ago

Thanks for all these details, now I am able to reproduce the issue, unfortunately, I missed the point previously and created the todo on the todo service itself instead of using the gateway. I will have a look at it as soon as possible.

Elytum commented 3 years ago

Oh, so it doesn't work on your side either ? Then, I'll stop looking configuration-side and investigate the library 😅 Isn't calling gateway by default the reason why gateway exists ?

claustres commented 3 years ago

Yes it is usually from the user/frontend point of view. However, some operations might also occur on the backend side directly on some services depending on the use case. Moreover, you can have different gateways replicated so that some operations might happen on a different one.

Elytum commented 3 years ago

I think I figured out the issue:

In publishService, feathers-distributed listen to events by service name and immediately publish them with no further correction. Therefore, in the case of the "todos" service, both instances are publishing any event they come across. So whenever one of them publish an event of type "todos", the other will receive it, and immediately re-publish it. The emitter will once more publish it, and so on, until a crash.

There appears to be only three ways to go around this issue:

1) Keep a blacklist of events not to publish in dispatching (line 121) in the particular case where a received event is also registered, which is immediately checked and popped once "service.on" is called, in order to ignore them. It might produce unexpected behaviour in some peculiar cases. It means allocating a bit of memory, even through the amount is pretty tiny. 2) Since feathers-distributed is relying on the keys "path" and "key" of events, it is implied that none should be set in the objects outside of feathers-distributed. Therefore, we could use "path" and "key" definition as a condition of propagation. It makes feathers-distributed even more relying on path and key. 3) When calling "service.emit" (line 123), temporarily remove/skip feathers-distributed propagation to the serviceEventsPublisher (line 53) with a custom implementation of "service.emit". I think this one doesn't need explanation, it's a big no.

I don't see any other way to solve this issue. Considering all the possibilities, 2 seems pretty good and easy to implement. Only issue is the reliance on "path" and "key", which means if at some point the project decide to go along without those two in order to keep the events "as is", that solution has to been rethought.

In order to apply the fix, we only need to add an extra line at position 54 (Just below "service.on"):

if (object.path && object.key) return

Do you need a pull request ? Do you see any other idea to fix this ?

claustres commented 3 years ago

Thanks for your feedback, it helps a lot. I ended up with a more radical conclusion: the example has an architectural design problem. Indeed, by default feathers-distributed publishes all local services and consumes all remote services. As a consequence, the replicated todos services are listening to each other, which is not what we want: we only want the gateway to listen to them. So the distribution configuration of the service in the example should be something like this:

app.configure(distribution({
  // We don't consume services we only produce
  remoteServices: (service) => false
})

The fact that services with similar names (actually paths) can't listen to each other is linked to the feathers architecture where the service name is a unique identifier. Because feathers-distributed add the remote service "as if it is a local one", in order to make calling it transparent, an app cannot simply have a local and remote service with the same name. It might be possible to handle this by using a different path (eg remote/ prefix) but this will break the current simplicity for a really small benefit: handle replication requiring state sync. Indeed, usually replicated microservices should be autonomous and don't need to be aware of others. Otherwise, this means there is a specific shared state to be maintained across replicated services. In this specific case, you should use https://github.com/feathersjs-ecosystem/feathers-sync.

However, even with the new configuration of the service shown above the bug appears. The problem is that the event publishers/subscribers are created even if no remote service is registered. Usually, this is not a problem because when the remote event is caught it is filtered here because the app does not have any service registered. But, in the specific case of replication, the app has a local service with the same name causing filtering not to be applied. As we have a flag identifying remote services (service.remote = true), I think the most simple solution is to also check if the service is a remote one before dispatching the event.

I will try it and check if there is any side-effect. If none appears we will publish a patch release.

Thanks again for your time and patience in investigating.

claustres commented 3 years ago

@Elytum I've just pushed to master if you'd like to give it a try. I will also check it with our own use cases before release.

Elytum commented 3 years ago

Tested on both the example and my project, and it appears to solve the issue entirely !

claustres commented 3 years ago

v1.0.5 released with this patch.