X10 tutorial - avoid `scalarized()` in loops for defaultXLA

mikowals commented 4 years ago

Keeping accumulated training statistics as tensors and only calling scalarized() for printing appears to speed up performance ~2x with XLA devices. The timings below were done on Colab free instances.

seconds per epoch - CPU - XLA goes from 82 to 39
seconds per epoch - TPU - XLA goes from 11.8 to 6.5
seconds per epoch - CPU - Eager - goes from 37 to 35 (probably noise, reported for comparison)

The changes I made are:

import Foundation to time and report seconds per epoch
data in Statistics struct as Tensors so scalarized() called outside batch loops
add Statistics.init(on: device) to avoid device mismatch errors
DRY accumulating batch statistics with update() method
format print strings with fewer decimal places

There may be a simpler way to get the same impact in the training tutorial. Also, I have seen similar Statistics code that might be a more general solution.

review-notebook-app[bot] commented 4 years ago

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

BradLarson commented 4 years ago

Great job on the profiling. I had forgotten that we'd done the same for other internal models to avoid the use of scalarized() and its performance consequences. This also simplifies the training loop code.

Would it be possible to clean the execution state of your notebook, removing the output from the cells and their execution order? I'd like for the notebook to start blank, and it would be preferable for the diff here to only be the code that was changed.

mikowals commented 4 years ago

I removed the outputs and execution state. I did not realise those got saved as part of the notebook.

mikowals commented 4 years ago

Thanks for bearing with me as I add silly mistakes while hurrying commits. I found your suggestions all very useful.

tensorflow / swift

X10 tutorial - avoid `scalarized()` in loops for defaultXLA #462