Huggingface trainer and huge refactoring of models and pretty much everything, lol

refactor of expert models such that they are disentangled with lightning logic
keeps both lightning trainer and the hf trainer
renamed config.py to arguments.py, in the next PR will rename ExpertConfig to ExpertArgs
returns CausalLMOutput from ExpertModel, to align with HF stuff
forward now does not accept a batch argument, to align with how HF expects models to behave
added Serializable class that handles deserialization and Serialization of dataclasses by dumping the module path under mttl (this could break if we move modules around, so we might find a better solution afterwards, I also implemented dynamic loading, where the stored class is searched in the current loaded modules if the module stored in the serialized form is not found anymore)
added HF training script and DownstreamEvalCallback for HF

microsoft / mttl