kalisio / feathers-distributed

Distribute your Feathers services as microservices
MIT License
141 stars 26 forks source link

Improve reliability with faulty apps #129

Closed claustres closed 7 months ago

claustres commented 7 months ago

It appears that in some situations when an app or a node hosting an app goes down or is restarted a service operation with a remote app result in a timeout.

As discussed in https://github.com/kalisio/feathers-distributed/issues/80#issuecomment-934596712 nothing is currently done when a node goes down. We should handle it to at least detect that a distribution key has no more responding apps and unregister it with all attached services.

We should also explore deeper how cote react to this situation to ensure we do not keep any "dead" component.

Note: at the present time we don't have any recipe to reproduce the issue, we should probably try to create a test suite simulating network failures or apps going down.