kruize / autotune-results

Recommendations and Results from Autotune
Apache License 2.0
3 stars 6 forks source link

Techempower results feedback #20

Closed tellison closed 1 year ago

tellison commented 2 years ago

Thanks for sharing the results of running Autotune on the techempower workloads.

A couple of questions arise from reading the descriptions, e.g. experiment-5

I'm struggling to understand the meaning of this comment, looks like a typo "were be updated"?

Manual runs for the top 5 best configuration were run to reproduce the data and were be updated at manuals dir

In the README it doesn't state (unless I'm missing it) how many configuration changes were considered? Looking at the graph it would appear to be 100 different configurations, is that correct? Clearly there are a great many potential configuration combinations based upon the variables stated and their range, so that doesn't feel like a significant coverage was considered.

Once a configuration experiment was chosen, how long did the system run? I'm curious to know if it reached a steady state in terms of the JVM and GC dynamic behaviors.

it would be interesting to also show which configurations had a significant impact on the results, and which made no/little difference, maybe helpful in refining the variables that are considered in the future.

The improvements over base look very interesting!

kusumachalasani commented 2 years ago

Thanks for the feedback Tim!

A couple of questions arise from reading the descriptions, e.g. experiment-5

I'm struggling to understand the meaning of this comment, looks like a typo "were be updated"?

Manual runs for the top 5 best configuration were run to reproduce the data and were be updated at manuals dir

Thanks for pointing out. It is supposed to be "were updated".

In the README it doesn't state (unless I'm missing it) how many configuration changes were considered? Looking at the graph it would appear to be 100 different configurations, is that correct? Clearly there are a great many potential configuration combinations based upon the variables stated and their range, so that doesn't feel like a significant coverage was considered.

Yes, it has run with 100 trials (different configurations). I do agree that there can be many more best combinations considering the range. For this, we are looking in 2 different ways:

  1. Increasing the no.of trials (may be 150-200 or more) depending on the range of tunables.
  2. As Bayesian optimization does exploration in the first few trials and start exploiting later, we can use that data to update the ranges and try more experiments. This is done only when we have limits to hardware availability.

On a hardware where we started our experiments, we have limits to its availability. So, when we started the experiments with 31 tunables, experiment-4 has high range of tunables and we have modified few of them in the next experiments based on the configuration we obtained from previous experiments. Although there is still room for better configuration, this helped in getting the optimized configuration for the 100 trials. And now we got access to different hardware, we are doing experiments with more trials and I will update the data soon.

Once a configuration experiment was chosen, how long did the system run? I'm curious to know if it reached a steady state in terms of the JVM and GC dynamic behaviors.

Each configuration is run for 3 iterations. And each iteration takes 6 minutes, out of which first 3 minutes are considered as warmup and last 3 minutes are considered for measurements. Details about the benchmark configuration were updated in benchmark.yaml for each experiment. We also did some manual runs on the same hardware where autotune experiments are run to determine how much time it would take for this benchmark to get consistency results (to give enough time for JVM optimizations) and that is how we have come up with 3 mins as warmup time.

it would be interesting to also show which configurations had a significant impact on the results, and which made no/little difference, maybe helpful in refining the variables that are considered in the future.

Yes, I accept that. As of now, experiment-data.csv has the data for all the configurations and we have sorted the data based on "response time" (lower the better). So, the top configurations can be considered as good ones. We are in the process of updating the results with more graphs which can help read through combination of tunables which worked best.