Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
On a scenario with multiple infrastructures connecting to a "main" chaoscenter/portal, we can have websocket disconnection cases. In those scenarios the subscribers exists with -1 code and pod is restarted, but fails to restablish connection because server responds ALREADY_CONNECTED.
We are proposing to changes
Subscriber
1) should disconnect with panic, so ws connection (which is defered) can be called.
2) should exit when receives a ALREADY CONNECTED message from the server, since it should retry
Portal/chaosCenter
2) Should close previous channel and return ALREADY connected error. That will trigger ctc.Done() and cleanup in-memory map with connected infras, so a subscriber restart will be able to connect again.
Types of changes
What types of changes does your code introduce to Litmus? Put an x in the boxes that apply
[ ] New feature (non-breaking change which adds functionality)
[ ] Bugfix (non-breaking change which fixes an issue)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation Update (if none of the other choices applies)
Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.
Proposed changes
On a scenario with multiple infrastructures connecting to a "main" chaoscenter/portal, we can have websocket disconnection cases. In those scenarios the subscribers exists with -1 code and pod is restarted, but fails to restablish connection because server responds ALREADY_CONNECTED.
We are proposing to changes
Subscriber
1) should disconnect with panic, so ws connection (which is defered) can be called. 2) should exit when receives a ALREADY CONNECTED message from the server, since it should retry
Portal/chaosCenter
2) Should close previous channel and return ALREADY connected error. That will trigger ctc.Done() and cleanup in-memory map with connected infras, so a subscriber restart will be able to connect again.
Types of changes
What types of changes does your code introduce to Litmus? Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Dependency
Special notes for your reviewer: