Optimize Execution I/O implementation

Test Environments: • DNNL and Discrete IE-MKLDNN tested on e306860 • Integrated IE-MKLDNN tested on 6f1b378 • Native OpenVINO IE-MKLDNN tested with openvino toolkit 2020.3 • CPU: Intel i7-1065G7 CPU @ 1.30 GHz 1.50GHZ, • OS: Windows

there seems to be larger gap to native when executing small models, say MobileNet (70% native), SquuzeNet (50% native). I guess the reason is the ratio between Execution I/O vs. compute time where small models have relatively short compute time. It looks like there are room to improve the Execution I/O implementation.

intel / webml-polyfill

Optimize Execution I/O implementation #1358