Find a way to reproduce our bugs more reliable and efficient - Solution 4: Get continuous stress test to work

Dominik1999 commented 4 years ago

The scenario player is too inefficient and unreliable for reproducing bugs

Our last three major bug categories - presence bug, performance issues, general "No route found", we found by running scenarios on the scenario player. However, it is hard to reproduce any bugs and test bug fixes with the current scenario player setup. Running scenarios is simply too and complex slow to be used as a tool for bug fixing.

Possible solution 4 to that problem

@hackaugusto started setting up a stress test with 3 continuous running nodes on görli that send transfers to each other back and forth. This test never ends and could help us to see/to reproduce bugs/symptoms like

presence bug (symptom)
pfs performance issues
"no route found" (symptom)

czepluch commented 4 years ago

Not sure I understand how this is more efficient or reliable than the scenario player. Can you please elaborate a bit on that for me?

Concretely I am interested in how it can more reliably reproduce bugs.

Dominik1999 commented 4 years ago

Hey, sure let me try to elaborate.

Currently, we all debug with the scenario player. This helps us a lot to find bugs (compared to our other tests). However, with the Scenario Player, we cannot reproduce bugs all the time. Sometimes, we need to run the BF1 scenario multiple times to see a bug again. This is very time-consuming.

Therefore, we need to find a way to reproduce those bugs (e.g. the presence bug symptom) reliably with every test run. That means, we want to see those bugs in a very simple environment every time to better debug.

Does that make sense?

czepluch commented 4 years ago

Yes, I understand what the goal you want to achieve is. But I don't understand how this specific solution of running 3 nodes 24/7 solves that problem.

Dominik1999 commented 4 years ago

We think, that:

we save a lot of time on this continuous run compared to setting up channels and the scenario player
we have a way lower maintenance for everything and less outside dependencies to e.g. another team
some bugs are flaky and happen rarely. So the more transfers we execute the more likely it is that we can reproduce the bug (compared to run the bf1 scenario many times in a row)
it is easier for us to get the logs, debug and rerun that

there might be other pros and probably some cons. With this setup, we will not be able to replace the scenario player completely, but it will help us to debug current bugs quicker. Do you agree with that approach?

czepluch commented 4 years ago

we save a lot of time on this continuous run compared to setting up channels and the scenario player

This I can agree to, since the maintenance is limited as you say. You only need to restart nodes once every 24 hours or on demand to have them run the newest version of Raiden. Unless a db migration is needed then it gets a bit messy.

we have a way lower maintenance for everything and less outside dependencies to e.g. another team

That is correct. And I think for debugging purposes, if having this can speed up development velocity, then it's amazing. Maybe I got it wrong from the beginning and assumed that you wanted to replace the SP, which I think is a very bad idea, since we have a lot of QA going on through the SP. But of course, if you can have shorter iterations and bug hunting with the other setup then it's a win win situation.

some bugs are flaky and happen rarely. So the more transfers we execute the more likely it is that we can reproduce the bug (compared to run the bf1 scenario many times in a row)

I am not entirely sure I agree here. At least the scenarios have quite often been good at reproducing bugs that couldn't be reproduced otherwise. It will only be very specific bugs that will be found through 3 nodes just doing payments to each other.

But yeah, in general as you mention, I think the SP should exist together with this new proposal rather than being replaces by it. At least from a QA point of view it's really nice to have the scenarios to assure some sort of robustness to the Raiden client and services.

Dominik1999 commented 4 years ago

I am not entirely sure I agree here. At least the scenarios have quite often been good at reproducing bugs that couldn't be reproduced otherwise.

You are right. This tool helps us a lot and we should keep it as User Acceptance Test. But for day to day debugging it is too complex. I hope we can find the same bugs with a lighter tool. But still unsure how to do that ... Let's try

raiden-network / raiden

Find a way to reproduce our bugs more reliable and efficient - Solution 4: Get continuous stress test to work #5420

The scenario player is too inefficient and unreliable for reproducing bugs

Possible solution 4 to that problem