tenstorrent / tt-buda

Tenstorrent TT-BUDA Repository
Other
162 stars 21 forks source link

[minor issue] deadlock when no TT devices present #26

Open jaebaek opened 1 month ago

jaebaek commented 1 month ago

I tested this example. Since I do not have a TT device on my machine, I expected it fails but a deadlock is unexpected. It stopped at this line when no TT devices present.

Log from tt-buda:

...
2024-06-02 19:46:05.045 | DEBUG    | pybuda.device:run_next_command:455 - Received COMPILE command on TTDevice 'auto_tt0' / 11430
2024-06-02 19:46:05.045 | DEBUG    | pybuda.ttdevice:compile_for:785 - Compiling for Inference mode on TTDevice 'auto_tt0'
2024-06-02 19:46:05.045 | INFO     | Runtime         - No silicon devices detected.
2024-06-02 19:46:05.046 | INFO     | Runtime         - No silicon devices detected.
2024-06-02 19:46:05.046 | ERROR    | pybuda.device:run_next_command:469 - Compile error: No Tenstorrent devices present.
Traceback (most recent call last):
  File "/home/jaebaek/tt-buda/pybuda/pybuda/device.py", line 458, in run_next_command
    ret = self.compile_for(
  File "/home/jaebaek/tt-buda/pybuda/pybuda/ttdevice.py", line 808, in compile_for
    device_cfg=self.get_device_config(compiler_cfg),
  File "/home/jaebaek/tt-buda/pybuda/pybuda/ttdevice.py", line 199, in get_device_config
    raise RuntimeError("No Tenstorrent devices present.")
RuntimeError: No Tenstorrent devices present.

Traceback (most recent call last):
  File "test.py", line 26, in <module>
    test_module_direct_pytorch()
  File "test.py", line 21, in test_module_direct_pytorch
    output = pybuda.PyTorchModule("direct_pt", PyTorchTestModule()).run(input1, input2)
  File "/home/jaebaek/tt-buda/pybuda/pybuda/module.py", line 95, in run
    output_q = pybuda.run_inference(self, inputs=[args])
  File "/home/jaebaek/tt-buda/pybuda/pybuda/run/api.py", line 90, in run_inference
    return _run_inference(module, inputs, input_count, output_queue, _sequential, _perf_trace, _verify_cfg)
  File "/home/jaebaek/tt-buda/pybuda/pybuda/run/impl.py", line 277, in _run_inference
    return _run_devices_inference(
  File "/home/jaebaek/tt-buda/pybuda/pybuda/run/impl.py", line 467, in _run_devices_inference
    output_queue = _initialize_pipeline(False, output_queue, sequential=sequential, verify_cfg=verify_cfg)
  File "/home/jaebaek/tt-buda/pybuda/pybuda/run/impl.py", line 414, in _initialize_pipeline
    _compile_devices(sequential, training=training, sample_inputs=sample_inputs, sample_targets=sample_targets, microbatch_count=microbatch_count, verify_cfg=verify_cfg)
  File "/home/jaebaek/tt-buda/pybuda/pybuda/run/impl.py", line 1248, in _compile_devices
    raise ret
RuntimeError: No Tenstorrent devices present.
2024-06-02 19:46:05.047 | DEBUG    | pybuda.run.impl:_shutdown:1265 - PyBuda shutdown
milank94 commented 1 month ago

@jaebaek can you expand on more details as to what the issue is? From our perspective, TT-Buda threw the correct error as no TT Devices were present.

jaebaek commented 1 month ago

Hi, this is not a critical issue (it does not really bothers me).

The problem is the line 1276 of impl.py#L1276. IIUC, it is supposed to complete _shutdown(..) function and ends the program execution. However, it actually does not ends the program, but it is just stuck in ctx.final_barrier.wait(). It looks like a deadlock. The program termination does not happen. I have to kill it using Ctrl + C.

jaebaek commented 1 month ago

I tested tt-buda more. For some other failures, it does not terminate the execution either. It seems to have the same dead lock.

staylorTT commented 1 month ago

@nvukobratTT I see you have worked on the impl.py file before do you know who could take a look at fixing this ? Maybe this needs to be wrapped in a try/except clause? Let me know your thoughts.