Stabilize runtime measurement for predictions

Including time_in_ms in outputted predictions works in principle. However, there's too much variability across datacenters that the value is unreliable. To measure this reliably, we need to do a more controlled experiment, fixing the accelerator and measuring model runtimes on that same accelerator. (See Uncertainty Baselines' internal profile Colab notebooks.)

google-research / robustness_metrics

Stabilize runtime measurement for predictions #5