Debug formatter for `Tensor` is confusing with > 1 GPU

When running a model across several processes using NCCL, the debug formatter output will print the same ID for two GPUs:

GPU 0: (dev=Cuda(CudaDevice(DeviceId(1))), shape=[1, 128256], len=128256)
GPU 1: (dev=Cuda(CudaDevice(DeviceId(1))), shape=[1, 128256], len=128256)

It's confusing when looking at logs and trying to figure out which GPU is doing what.

Would it be a problem to include the CUDA device ordinal in the debug formatter? If not I'll open a PR.

huggingface / candle