Fix memory leak in SegmentReduction ops and improve performance

The RAII wrapper that we used to deallocate the output of the SegmentReduction ops that were run on the CPU through the Eager API wasn't actually deallocating the memory because we were passing a pointer that was always null. To fix this, we needed to pass a pointer to the pointer (TFE_TensorHandle**).

There's also a slight performance increase by adding the CopyDeviceTensorsToCPU function which allows kernels to copy more than one tensor at once and flush/sync only once, instead of doing it twice.

microsoft / tensorflow-directml-plugin

Fix memory leak in SegmentReduction ops and improve performance #279