y-axis in Fig 7(left) - Githubissues

google-research / nasbench

NASBench: A Neural Architecture Search Dataset and Benchmark

Apache License 2.0

685 stars 128 forks source link

y-axis in Fig 7(left) #10

Open bkj opened 5 years ago

bkj commented 5 years ago

The left plot in Fig 7 in the paper shows test regret -- can you explain how that's computed exactly?

I know it's log10(y - y_best) -- but what is y_best exactly? Is that the best validation/test accuracy for a single model run / averaged across the 3 model runs?

I think the four possibilities would be:

test acc           mean across 3 runs           0.943175752957662
test acc           maximum across 3 runs        0.9466145634651184
validation acc     mean across 3 runs           0.9505542318026224
validation acc     maximum across 3 runs        0.9518229365348816

Thanks!

aaronkl commented 5 years ago

it's the mean test accuracy, i.e 0.943175752957662

bkj commented 5 years ago

Ok thanks. So it sounds like it’s

log(best_mean_test_acc - arch_mean_test_acc)

then? Otherwise it would be possible to have -inf regret?

I was also wondering — do you have a similar plot for validation accuracy that you could share?

aaronkl commented 5 years ago

yes exactly it's log(best_mean_test_acc - arch_mean_test_acc)

I attached Figure 7 just with the validation regret on the y-axis. comparison_time_all_mean.pdf comparison_time_all_mean_valid.pdf

Note that, we found some slightly better hyperparameters for SMAC and BOHB that's why they improved. For comparison I also added the original Fig 7 with the updated test regret.

bkj commented 5 years ago

I put code that attempts to reproduce the results of the random search here: https://gist.github.com/bkj/8ae8da3c84bbb0fa06d144a6e7da8570

The results don't look exactly the same as in the paper -- the best regret is around 5.5 * 1e-3 vs what looks like about 4.1 * 1e-3 in the paper. Any thoughts on where the differences might be coming from?

Roughly the procedure is: 1) sample sequence N random architectures 2) sample a validation accuracy per architecture 3) plot log10(best_mean_test_acc - arch_mean_test_acc) for the architecture w/ the best validation accuracy seen so far

Plot of results:

Edit: Perhaps the issue is line 73 -- do you use the mean validation accuracy across the 3 runs for model selection, as opposed to a sample of a single run? Updated the plot above to show the difference.