Open nchristensen opened 1 year ago
With loops only tagged 'for' it is probably fast enough.
For tuning purposes, it may be sufficient to measure the flop rates of single batches rather than the entire kernel.
With loops only tagged 'for' it is probably fast enough.