Optimize output allocation for inputs that can be forwarded

When in-place execution is possible (mostly for element-wise operators), forward one of the inputs if all following conditions exist:

The input isn't used for another operator that hasn't yet been executed
The input has the same datatype, shape and size as the output
For composite operators, the input isn't used more than once. This is important since we're writing to the input/output in a non-deterministic fashion, so elements can't be relied upon more than once

microsoft / tensorflow-directml