Open radna0 opened 4 weeks ago
do you have a small repo code?
You can clone this repo here
git clone https://github.com/radna0/Video-Infinity.git
install requirements by using
pip install -r requirements.txt
and test out the code using
accelerate launch tpu_inference.py --config examples/config.json
Let me know if I am missing anything
Were you able to reproduce the error? @JackCaoG
It's been a week, and I''m still encountering this problem. I have tried different methods for example: dist.gather(), tensor.cpu(), tensor.contiguous() or other methods related to saving tensors also moves to CPU and run into the same problem here. Even with xm.mark_step(). There is no other way around this and it has always been the same error replica groups should contain 8 replicas, but found 2
. Is there something wrong that I could be doing here? What I'm basically doing is the following:
For each rank, declare a distributed controller,
class DistController(object):
def __init__(self, rank, world_size, config) -> None:
super().__init__()
self.rank = rank
self.world_size = world_size
self.config = config
self.is_master = rank == 0
self.device = torch_xla.device()
self.init_dist()
self.init_group()
def init_dist(self):
print(
f"Rank {self.rank}, {self.device} / {self.world_size} is running on XLA device."
)
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = str(self.config.get("master_port") or "29500")
dist.init_process_group("xla", rank=self.rank, world_size=self.world_size)
def init_group(self):
self.adj_groups = [
dist.new_group([i, i + 1]) for i in range(self.world_size - 1)
]
print(f"Rank {self.rank} initialized groups: {self.adj_groups}")
3. init the model and move it to the xla device, then do inference.
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, call_item.kwargs)
File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(args) for args in chunk]
File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kojoe/Video-Infinity/tpu_inference.py", line 102, in
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Should move tensors to cpu.
Environment
Additional context