Parallelize trainer base evaluation in DDP setting

When training models, the bulk of evaluation is done on the main worker. When we train with 8 GPUs, we should get around an 8x speedup on eval, which would make a difference with large evaluation sets.

The main culprit is this method: https://github.com/jxmorris12/vec2text/blob/master/vec2text/trainers/base.py#L363C5-L365C27 and the subsequent call to _get_decoded_sequences in the Base trainer class. We explicitly enumerate over an eval dataloader of the first n samples which (I think) will happen once in each worker. Instead, we should split the work among multiple GPUs.

jxmorris12 / vec2text

Parallelize trainer base evaluation in DDP setting #10