Fault Tolerance Testing

visheshdembla commented 3 years ago

The idea here is to crash each of the service manually or using a program like KubeMonkey and observe what works and what does not work. Based on the findings, we need to write a report and come up with changes that we need to make into the architecture to ensure that our system is fault tolerant. We also need to ensure that in case the number of replicas of each service is scaled, crashing of one pod does not result in service being unavailable.

I would be taking up the issue.

visheshdembla commented 3 years ago

Following observations were made with regards to fault tolerance of the micro-service architecture:

In case of multiple replicas, if a pod is crashed, subsequent requests are taken over by the replicas. This works seamlessly as all the services are stateless in nature.
If all replicas of gateway or react UI are crashed, the application as a whole would not be available as these two are the single point of failures for the system. This limitation would be addressed in the subsequent milestones using blue-green deployment.
If all replicas of auth service are crashed, the login and sign up functionalities would stop working, however the user who is logged in would be able to still use the service for uploading and downloading images.
If all the replicas of the user service are crashed, the sign up functionality of the application would stop working.
If all the replicas of image service are crashed, the image upload, download and view list of all images stops working, however, login and sign up work independent of the failure.
If all the replicas of the session serivce are killed, the loading of the landing page and sign in and sign up page works correctly, however, the application as a whole would not be able to provide the service as the session validation fails.
If all the replicas of the session log service are killed, the application functionality remains unaltered except for the fact that no session logs would be available to re-create a lost session.

visheshdembla commented 3 years ago

Will update the wiki with the findings, the screenshots and the relevant process followed to get the results.

visheshdembla commented 3 years ago

Wiki Updated with the findings

visheshdembla / Panorama

Fault Tolerance Testing #200