`empty_cache` before `barrier`

Barrier tries to allocate 8 MB that my run can't afford:

    File "/home/ubuntu/carlos/torchtitan/train.py", line 451, in main
      utils.set_pg_timeouts(
    File "/home/ubuntu/carlos/torchtitan/torchtitan/utils.py", line 75, in set_pg_timeouts
      torch.distributed.barrier(device_ids=[torch.cuda.current_device()])
    File "/home/ubuntu/.pyenv/versions/titan/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
      return func(*args, **kwargs)
    File "/home/ubuntu/.pyenv/versions/titan/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4376, in barrier
      work = group.barrier(opts=opts)
  torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:340, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
  ncclUnhandledCudaError: Call to CUDA function failed.
  Last error:
  Failed to CUDA calloc 8388608 bytes

Since the script only runs this code after the first step, it should not be too expensive to empty_cache right before.

My run already sets export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

pytorch / torchtitan

`empty_cache` before `barrier` #660