data_flow_ops.RecordInput outperforms tf.data

@reedwm I did some performance tests of Resnet50 (also _v1.5 and v2) on Tesla T4 and V100 GPUs (1-8). I found the input pipeline made by data_flow_ops.RecordInput + data_flow_ops.StagingArea generally outperforms tf.data + multi_device_iterator_ops.MultiDeviceIterator and tf.data + data_flow_ops.StagingArea, where the first one is activated by setting --datasets_use_prefetch=False --use_datasets=False, the second one --datasets_use_prefetch=True --use_datasets=True, and the third one --datasets_use_prefetch=False --use_datasets=True. However, I found the models I had encountered so far all applied tf.data API in their input pipelines. Since the tests I did showed better performance using data_flow_ops.RecordInput rather than tf.data, how do you suggest which one we should use?

tensorflow / benchmarks

data_flow_ops.RecordInput outperforms tf.data #496