Refac: Straightforward output shape permutation

This PR is about where and when we call MaybePermuteHWC2CHW(). It's not about tensor allocation (this will come, later).

At a high-level, this PR changes all conditional call patterns like:

if (cond) {
  output.frames = MaybePermuteHWC2CHW(output.frames)
}

to a plain, unconditional

output.frames = MaybePermuteHWC2CHW(output.frames)

This makes it a lot simpler to reason about our output shape permutation. In main, cond is typically input-dependent (but really, caller-dependent), and it leads to a state that's hard to reason about.

Another benefit of this PR is that now all low-level decoding routines (like convertAVFrameToDecodedOutputOnCPU()) have a simpler interface: they only ever take and return HWC tensors.

At a lower level, the following changes were made:

MaybePermuteHWC2CHW() is now a method so we can pass a streamIndex. It makes its interface slightly simpler.
It's now up to every high-level decoding function to call MaybePermuteHWC2CHW().
Some methods like getFrameAtIndex() and getNextDecodedOutputNoDemux() were used both as a high-level decoding entry-point and as a low-level subroutine of other entry-points. I split those into getFrameAtIndex()/getFrameAtIndexInternal() and getNextFrame()/getNextDecodedOutputNoDemux() to clearly distinguish between the public entry point and the underlying private helper. Note that this isn't just a "nice-to-have" or a nit-pick, it's a necessary change for the goal of this PR.
getNextFrame() is the new public entry point, getNextDecodedOutputNoDemux() is now private.

A follow-up of this PR will be to unify the tensor allocation. I think it'd make sense for tensors to always be pre-allocated by the high-level decoding entry points. It will allow us to unify the allocation logic in a single place.

pytorch / torchcodec

Refac: Straightforward output shape permutation #317