benchmark: progressive workflow submission in time

tiborsimko commented 2 years ago

Currently, the benchmark script was focusing on a situation of running all N workflows in parallel. The workflows were submitted and then "suddenly" all started at once. This allowed to study the steady-state burn down throughput of how many workflows can be run in the cluster.

The aim of this issue is to build on top of #572 and expand the functionality of "alternative submisssions" to the "delayed submission" scenarios. This will be useful for the test of running 100k pMSSM NoSys workflows.

For example, we shall submit 10k and start them, then in an hour submit another 10k and start them, etc until all 100k are completed.

The "orchestration" of delayed submission can be done by outer shell script or cronjob. The changes necessary to the benchmarking script are concerned notably with allowing "partial workflow run ranges" for this deployment.

For example:

The submission will be progressive, e.g. submit -w mytest -i 200-249 would submit only mytest-200, mytest-201, ..., mytest-249, without concerning itself with the status of other run numbers.
The start will be progressive, e.g. start -w mytest -i 210-220 will start only mytest-210, mytest-211, ..., mytest-220.
The analysis will be progressive, e.g. analyze -w mytest -i 50-100 will produce the plots only for those run numbers, regardless of the status of the remaining runs and where those 100k workflows might be, e.g. not even submitted yet.

Note that this will also necessitate the presentation changes for plots. Let's take execution_progress.png as an example:

Here, x-axis represents the workflow run number as they workflows were submitted/started/etc. But this number won't be representative anymore, because not all workflows are created/uploaded at the same time anymore. So there can be a "large space" between workflow run 49 and 50 and "small space" between workflow run 48 and 49 depending on the submission policy.

It would be therefore desired to better represent "real time arrow" and show when each workflow-N was created, uploaded, started, etc. This will help to create a better representation of the flow of events in each test. For example, in test A everything will be submitted in advance and only started progressively, say 100 workflows each minute; whilst in test B the submission will be done alongside running, say 50 workflows each minute, which will stress the cluster differently. We should therefore capture this on the plots.

For example, we could create one "real time graph" with the overall time flow presentation including submission, and we keep another graph with "execution progress" only that will not concern itself with submission stages, only "consumption" stages; but with "real time" axis, not "artificial run number" axis.

Tasks:

[x] submit, start, launch commands support range (PR 1)
[x] collect adapted to use with ranges (PR 3)
[x] analyze supports ranges to "zoom in" (PR 2)
[x] analyze produces plots that support ranges (PR 2, PR 3)

VMois commented 2 years ago

for submit, start and launch I assume, we will replace -n with -i. So, in case someone will want to run a full range (as before) they will need to indicate it, for example, -i 0-200

VMois commented 2 years ago

collect will not change, it will collect all available runs for a given -w (basically as before). We will filter data in analyze stage later.

tiborsimko commented 2 years ago

WRT 1, we don't really need two values, indeed. We can leave it up to the script caller to provide proper values. So we can keep -n just change the logic to require n_min-n_max range, and add it to all commands.

WRT 2, this could do, if we don't differentiate between "created" and "uploaded", we . I was thinking perhaps 'updated' might be necessary to fetch from the DB, but we can forget about it for now.

BTW one related suggestion: in order to better connect the commands such as "collect" and "analyze" with the names of the files, what about renaming "original_results.csv" to "collected_results.csv" and "processed_results.csv" to "analyzed_results.csv"?

One possible compilcation will be a need to have several processed files. E.g. whilst the 100k test run is going strong, one may want to produce a series of plots for 0-1000,1001-2000, 2001-2000 etc. So we may need to amend the collected/processed file name to have something like *_collected_0_1000.csv and *_analyzed_0_1000.csv so that we don't have to always re-collect all 100k if we produce many plots whilst the big test is running...

VMois commented 2 years ago

analyze step looks the most complicated to me, I do not fully understand yet how to represent plots in a new way so it will require more thinking, making it work, and testing. I think I will make a couple of PRs for this issue: one with ranges and one for analyze.

VMois commented 2 years ago

Agree with renaming results files to collected and analyzed. Didn't like how they were named before :)

tiborsimko commented 2 years ago

WRT 3, yes let's try various diagrams. E.g. here is one ASCII art illustration of a plot that could more easily expand to the right (monitors are wider than taller):

              |
              |
    run 30    |                       *---------*--------------*-----------*
    run 29    |            *----------------*----------*-----------*
    run 28    |            *--------*----------------*-----------------*
              |         create   start            running          finished
              |                     (queued + pending)
              |
              +----+---------+---------+--------+-------+-------+-------+------->
                 10:01     10:02     10:03    10:04   10:05   10:06   10:07

                                                                            time

The complexity will be how the graph will look with 1000 items...

VMois commented 2 years ago

opinion: Regarding ranged results files (*_collected_0_1000), in addition to changing the logic of collect it will also require the merging of those ranged files at the end to have final results and may introduce data inconsistency (e.g some workflows will be updated in one file but not the other).

suggestion: I would prefer to optimize getting all workflows (as much as available) from REANA (using pages and sizes + list command) first and see how it behaves when there is a big number of workflows. In case, the time to collect is enormous, we can think about further improvements.

Example:

$ ./reana_bench.py collect -w test -f  # will collect all data about workflows it can get from REANA (as it is now)
$ ./reana_bench.py analyze -w test -i 100-200  # decide which subset to analyze

question: From your experience, how long does the collect -w test command take to finish for 1000 workflows?

tiborsimko commented 2 years ago

We can also change the structure of the collected file and not use CSV, but some pickled Python dictionary, say, so that when we recollect without force, and if certain run N was already collected and finished, it won't have to be recollected later. But I reckon this would require lots of queries to the REST API for each run, which would also not be optimal...

WRT timings for 1k workflows, it takes about 40 seconds to collect the results. So this would mean more than one hour to collect 100k workflows. Now imagine if this process dies in the middle, one would have to restart it. Or if we would like to produce plots during run after 10k workflows, we would have to wait one hour still?

It might be better to get ready to collect "partially", e.g. certain run number ranges only. Or, perhaps, as an MVP we could collect certain statuses, notably only "finished" ones. In this way list --filter name=... --filter status=... could be relatively fast.

reanahub / reana

benchmark: progressive workflow submission in time #573