Change vmap to work around "vmap-incompatible in-place errors"

There are a couple of options here:

Always expand_copy factory function calls. This leads to another set of tradeoffs: a program that would have worked today would not work anymore, as well performance differences (if the expanded tensor didn't need to be expanded, then that's a perf hit).
Have torch.zeros return something lazy that later determines if it needs to be expanded or not. This needs design and I'm not sure if it can avoid all the edge cases

pytorch / functorch