Provide a new metric to compare the models being evaluated

ouankou commented 4 years ago

From Dr. Liao: The goal is to consider both model accuracy and overhead. There may be some existing metrics like this. Please find them out. In investment, there is a concept of risk-adjusted return. https://www.investopedia.com/terms/r/riskadjustedreturn.asp You can borrow this concept and apply it to models, by defining something like overhead-adjusted accuracy.

ouankou commented 4 years ago

Given two models, considering all problem sizes, the accuracy of model 1 p1 is 30% and for model 2 p2 is 95%. In general, model 2 is much better than model 1. However, if an application only has the problem size that model 1 is good at, it could be better choice. The overhead of applying the models is also an important factor since the model 2 might be expensive to use. Therefore, we need to consider specific application and overhead, not just the accuracy.

I think the overhead-adjusted accuracy can be designed as follows.

S = sum(t(x) * (g(x) XOR p(x))) - t_overhead

S: overhead-adjusted accuracy, which is the expected profit of a model. x: problem size sum: the total number of iterations in an application. Duplicated problem sizes can occur in the iterations. g(x): the ground truth of choosing CPU or GPU for a particular problem size. 0 or 1 to indicate CPU or GPU. p(x): the prediction of choosing CPU or GPU from a model. 0 or 1 to indicate CPU or GPU. t(x): the saving time when the correct choice is made. t(x) = |t_cpu - t_gpu|. g(x) XOR p(x): making a correct choice is 1, where the ground truth equals to the prediction. (1 XOR 1 = 1, 0 XOR 0 = 1)

ouankou commented 4 years ago

The design above enumerates the problem size of all the iterations. We could also use the probability of a problem size occurs in the equation instead. They means the same thing, but in different ways to represent.

S = sum(q(x) * t(x) * (g(x) XOR p(x))) - t_overhead

q(x): the probability of a certain problem size occur in the application. For example, an application may have 10 iterations that are 9 problem size 50 and 1 problem size 100. In this case, q(50) = 0.9 and q(100) = 0.1. sum: all the unique problem sizes. In the example above, it's 50 and 100.

ouankou commented 4 years ago

Assume we have the following data: For each time of calling CPM model, the average overhead is 0.01 ms. For each time of calling FPM model, the average overhead is 0.05 ms.

Problem size	Performance difference \	CPU - GPU\	(ms)
32	0.5
64	0.5
128	0.1
256	1

CPM can only make correct choice to problem size 32. Its accuracy is 25%. FPM only makes the wrong choice to problem size 64. Its accuracy is 75%.

With the ground truth above, given a testing case using 2D stencil as example, there are 1000 iterations of problem size 32, 64, 128, 256 for each.

CPM: S1 = 0.25 * 0.5 - 0.01 = 0.115
FPM: S2 = 0.25 * 0.5 + 0.25 * 0.1 + 0.25 * 1 - 0.05 = 0.35

For this application, FPM is better than CPM.

Given another example, there are 2000 iterations of problem size 32 and 64 for each.

CPM: S1 = 0.5 * 0.5 - 0.01 = 0.24
FPM: S2 = 0.5 * 0.5 - 0.05 = 0.2

In this case, CPM works better than FPM.

ouankou commented 4 years ago

Beside the consideration of overhead and profit, this new metric is the reversed usage of balanced accuracy. Balanced accuracy is used to fix the bias from imbalanced dataset to determine the actual performance of a model.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html#sklearn.metrics.balanced_accuracy_score https://statisticaloddsandends.wordpress.com/2020/01/23/what-is-balanced-accuracy/

However, in our case, an application could be imbalanced. It means the application may have tons of iterations with problem size x1 and x2, but no others. Therefore, to compare two models on this application, we have to calibrate their accuracy based on the imbalanced dataset x1 and x2 and ignore the rest problem sizes.

chunhualiao commented 4 years ago

So your S's final value is a unbounded execution time [0, infinite]. This is not portable across different applications or platforms. How do you measure a model's S to predict a set of applications on different machines?

The metric should be a ratio to be portable.

chunhualiao commented 4 years ago

g(x) XOR p(x) generates 0 or 1. This is also a problem if a model generates floating point values. Again, it is best to be a ratio against the ground truth.

ouankou commented 4 years ago

So your S's final value is a unbounded execution time [0, infinite]. This is not portable across different applications or platforms. How do you measure a model's S to predict a set of applications on different machines?

The metric should be a ratio to be portable.

With the ground truth of CPU and GPU performance data, we can make the right decision every time and save the most time. We could use the ratio of actual profit of model against the optimal profit of ground truth, which is 0 to 1. 1 is not included because the optimal profit can never be reached after adding overhead.

S = (model_profit - overhead) / optimal_profit
= (sum(q(x) * | x/g_c(x) - x/g_g(x) | * sign(p_c(x) - p_g(x)) XOR sign(g_c(x) - g_g(x))) - t_overhead) / sum(q(x) * |x/g_c(x) - x/g_g(x)|)

S: overhead-adjusted accuracy, which is the percentage of optimal profit a model can get with given application and platform. x: problem size q(x): the probability of a certain problem size occur in the application. For example, an application may have 10 iterations that are 9 problem size 50 and 1 problem size 100. In this case, q(50) = 0.9 and q(100) = 0.1. sum: all the unique problem sizes. In the example above, it's 50 and 100. sign(...): get the sign of a given value, 1 for non-negative, 0 for negative. g_c(x): the ground truth speed of CPU solving a particular problem size x. e.g. 301.45 element/ms g_g(x): the ground truth of speed of GPU solving a particular problem size x. e.g. 8010.60 element/ms p_c(x): the predicted speed of CPU solving a particular problem size x. e.g. element/ms p_g(x): the predicted speed of GPU solving a particular problem size x. e.g. element/ms t_overhead: the average overhead of calling the model

For previous two examples above:

Example 1:

CPM: S1 = (0.25 * 0.5 - 0.01) / (0.25 * 0.5 + 0.25 * 0.5 + 0.25 * 0.1 + 0.25 * 1) = 0.115 / 0.525 = 0.219
FPM: S2 = (0.25 * 0.5 + 0.25 * 0.1 + 0.25 * 1 - 0.05) / (0.25 * 0.5 + 0.25 * 0.5 + 0.25 * 0.1 + 0.25 * 1) = 0.35 / 0.525 = 0.667

Example 2:

CPM: S1 = (0.5 * 0.5 - 0.01) / (0.5 * 0.5 + 0.5 * 0.5) = 0.24 / 0.5 = 0.48
FPM: S2 = (0.5 * 0.5 - 0.05) / (0.5 * 0.5 + 0.5 * 0.5) = 0.2 / 0.5 = 0.4

With this revised score calculation, for a given application on a machine, we can directly have an impression of how good the model is. The score represents how much percentage of ideal performance the model can get. For instance, in the example 1, CPM has a score of 0.219 and we can say this model is not good without comparing to other models. Given an application x on a machine y, if any model has a score of 0.99, we know it's a nearly perfect model in this case.

ouankou commented 4 years ago

The time cost of building and using a model is an unbounded value, such as 1ms, 10s, and so on. To fit this overhead into bounded metrics (0 to 1), we have to convert the value to a ratio. Accuracy itself is a value without any units. To put the overhead that has a time unit into consideration, we have to add another variable of time cost to rule out the time unit of overhead. Therefore, the time we can save by choosing the correct computing device should be used, which is considered as profit.

overhead-adjusted accuracy = accuracy - usage overhead percentage per iteration - building overhead percentage per iteration

usage overhead percentage per iteration: time cost of calling a model / profit based on ground truth building overhead percentage per iteration: time cost of building a model / total iterations

There are two cases for using a model in the iterative algorithms.

Infinite iterations In this case, if the model is only built once instead of built every iteration, the average building overhead equals to zero. By default, we assume there are unlimited number of iterations.
Finite iterations The overhead is calculated as described above.

Case 1: Given a CPM model and infinite iterations, the accuracy is 20%, the usage overhead is 0.1%, the building overhead is 0. Then its overhead-adjusted accuracy is 19.9%. Another FPM model, the accuracy is 80%, the usage overhead is 0.5%, the building overhead is 0. Its overhead-adjusted accuracy is 79.5%.

Case 2: With the same models in case 1, but CPM needs to be rebuilt every 10 iterations and its average building overhead is 10%. Now its overhead-adjusted accuracy is 9.9%.

Case 3: With the same models in case 1, but the applications will only run 1000 iterations. The building overhead of CPM is 0.2% while the building overhead of FPM becomes 50% due to very long building time. Now the overhead-adjusted accuracy of CPM is 19.7% and for FPM it's 29.5%.

ouankou commented 4 years ago

Consider the following time costs in the building time: 1. code update, 2. training, 3, postprocessing 4. curve fitting.

ouankou / llnl-work-notes

Provide a new metric to compare the models being evaluated #28