Separate transformer and trainer checkpoint load logic

Since we no longer want to instantiate multiple Trainer instances.

Also fixes a device-related bug in the process (see last commit):

This was covered up by previous usage where we'd instantiate trainers for each transformer (semantic/coarse/fine). Now that we have to be able to load transformers without their corresponding trainers, because we can only load one accelerator at a time, the code in trainer that checks wrapper device no longer applies and we get device mismatches. This commit fixes that

I tested this code and was able to get a small training run of a few hundred steps working so it works e2e!

lucidrains / audiolm-pytorch

Separate transformer and trainer checkpoint load logic #214