On GPU accelerated implementations the first iteration includes the initialization of the device and the data copy. This distorts the iterations/second with a significant outlier. It would be good to measure and report the times of the first iteration and rest independently.
On GPU accelerated implementations the first iteration includes the initialization of the device and the data copy. This distorts the iterations/second with a significant outlier. It would be good to measure and report the times of the first iteration and rest independently.