Grayskull e150 hangs during long inference loop

JonathanALevine commented 4 months ago

For context this code is run on a tenstorrent grayskull e150.

If I run this

    # try the forward pass on the tt-card
    module = pybuda.PyTorchModule("direct_pt", myNet())

    tt0 = pybuda.TTDevice("tt0", 
                          module=module, 
                          arch=pybuda.BackendDevice.Grayskull,
                          devtype=pybuda.BackendType.Silicon)

    its = 1000
    for i in range(its):
        input = torch.rand(1,1)
        print(f"PyBUDA forward pass ..... {i}")
        output = pybuda.run_inference(inputs=[input])
        print(output.get())

Everything runs as expected. But If I run this again with its=10000, the card seems to hang at the end. I get stuck with these info messages:

2024-03-04 13:49:09.189 | DEBUG    | pybuda.run.impl:_shutdown:1262 - PyBuda shutdown
2024-03-04 13:49:09.189 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on TTDevice 'tt0'
2024-03-04 13:49:09.189 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on TTDevice 'tt0'
2024-03-04 13:49:09.190 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on TTDevice 'tt0'
2024-03-04 13:49:09.190 | DEBUG    | pybuda.run.impl:_shutdown:1278 - Waiting until processes done
2024-03-04 13:49:09.190 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: quit
2024-03-04 13:49:09.194 | INFO     | Runtime         - Waiting for cluster completion
2024-03-04 13:49:09.194 | INFO     | PerfPostProcess - Writing the host postprocess report in /tmp/jonathan/cea3632bcf87/perf_results//host/device_alignment_th_567154466_proc_21750.json
2024-03-04 13:49:09.249 | INFO     | Runtime         - Closed all devices successfully
2024-03-04 13:49:09.249 | INFO     | PerfCheck       - Starting performance check for host events
2024-03-04 13:49:09.249 | INFO     | PerfCheck       - Finished host performance check successfully
2024-03-04 13:49:09.249 | INFO     | Debuda          - Debug server ended on 
2024-03-04 13:49:09.249 | INFO     | Always          - finish_child_process called on pid 21750

myNet is defined as:

class myNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(1, 3)
        self.output = nn.Linear(3, 1)

    def forward(self, x):
        x = self.hidden(x)
        return self.output(x)

milank94 commented 4 months ago

@JonathanALevine Do you see the outputs and info statement printed up to the 1000? i.e.

print(f"PyBUDA forward pass ..... {i}")

JonathanALevine commented 4 months ago

@milank94 Yes, outputs and infos get printed even up to 10000. See here:

PyBUDA forward pass ..... 9998
2024-03-04 14:07:34.364 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt0'
2024-03-04 14:07:34.364 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt0' / 39342
2024-03-04 14:07:34.364 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt0'
2024-03-04 14:07:34.365 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-04 14:07:34.365 | INFO     | Runtime         - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
2024-03-04 14:07:34.365 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue direct_pt.output_add_5
2024-03-04 14:07:34.365 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-04 14:07:34.366 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue input_1
2024-03-04 14:07:34.367 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-04 14:07:34.367 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue direct_pt.output_add_5
[PyBuda Tensor: tensor([[-0.3359]], dtype=torch.bfloat16, requires_grad=True), DataFormat.Float16_b]
tensor([[0.9505, 0.7036, 0.0594,  ..., 0.4409, 0.2454, 0.0907]])
PyBUDA forward pass ..... 9999
2024-03-04 14:07:34.368 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt0'
2024-03-04 14:07:34.369 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt0' / 39342
2024-03-04 14:07:34.369 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt0'
2024-03-04 14:07:34.369 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-04 14:07:34.369 | INFO     | Runtime         - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
2024-03-04 14:07:34.369 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue direct_pt.output_add_5
2024-03-04 14:07:34.370 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-04 14:07:34.370 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue input_1
2024-03-04 14:07:34.371 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-04 14:07:34.371 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue direct_pt.output_add_5
[PyBuda Tensor: tensor([[-0.3320]], dtype=torch.bfloat16, requires_grad=True), DataFormat.Float16_b]
2024-03-04 14:07:34.374 | DEBUG    | pybuda.run.impl:_shutdown:1262 - PyBuda shutdown
2024-03-04 14:07:34.374 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on TTDevice 'tt0'
2024-03-04 14:07:34.374 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on TTDevice 'tt0'
2024-03-04 14:07:34.374 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on TTDevice 'tt0'
2024-03-04 14:07:34.374 | DEBUG    | pybuda.run.impl:_shutdown:1278 - Waiting until processes done
2024-03-04 14:07:34.375 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: quit
2024-03-04 14:07:34.381 | INFO     | Runtime         - Waiting for cluster completion
2024-03-04 14:07:34.381 | INFO     | PerfPostProcess - Writing the host postprocess report in /tmp/jonathan/cea3632bcf87/perf_results//host/device_alignment_th_3282791031_proc_39342.json
2024-03-04 14:07:34.404 | INFO     | Runtime         - Closed all devices successfully
2024-03-04 14:07:34.404 | INFO     | PerfCheck       - Starting performance check for host events
2024-03-04 14:07:34.404 | INFO     | PerfCheck       - Finished host performance check successfully
2024-03-04 14:07:34.404 | INFO     | Debuda          - Debug server ended on 
2024-03-04 14:07:34.405 | INFO     | Always          - finish_child_process called on pid 39342

But it might be the way I am running the forward pass. I am reading these docs now: https://docs.tenstorrent.com/tenstorrent/v/tt-buda/pybuda/first_model

milank94 commented 4 months ago

Ok awesome -- so that is all normal!

Your program finished and the pybuda_shutdown() sequence was initiated to close the connection.

If you want to keep the program active i.e. waiting for new inputs, then you either need to setup some sort of continuous loop that waits for new input i.e. While True: or you can use a serving backend like FastAPI (see example here: https://github.com/tenstorrent/tt-buda-demos/blob/main/first_5_steps/5_serving_tt_models.ipynb)

JonathanALevine commented 4 months ago

But why would pybuda_shutdown() sequence be much longer when its=10000 then when its=1000?

For example when its=1000, the program exits:

PyBUDA forward pass ..... 999
2024-03-04 14:29:24.798 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt0'
2024-03-04 14:29:24.799 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt0' / 52758
2024-03-04 14:29:24.799 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt0'
2024-03-04 14:29:24.799 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-04 14:29:24.799 | INFO     | Runtime         - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
2024-03-04 14:29:24.799 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue direct_pt.output_add_5
2024-03-04 14:29:24.800 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-04 14:29:24.800 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue input_1
2024-03-04 14:29:24.801 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-04 14:29:24.801 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue direct_pt.output_add_5
[PyBuda Tensor: tensor([[-0.3320]], dtype=torch.bfloat16, requires_grad=True), DataFormat.Float16_b]
2024-03-04 14:29:24.802 | DEBUG    | pybuda.run.impl:_shutdown:1262 - PyBuda shutdown
2024-03-04 14:29:24.803 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on TTDevice 'tt0'
2024-03-04 14:29:24.803 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on TTDevice 'tt0'
2024-03-04 14:29:24.803 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on TTDevice 'tt0'
2024-03-04 14:29:24.803 | DEBUG    | pybuda.run.impl:_shutdown:1278 - Waiting until processes done
2024-03-04 14:29:24.804 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: quit
2024-03-04 14:29:24.811 | INFO     | Runtime         - Waiting for cluster completion
2024-03-04 14:29:24.812 | INFO     | PerfPostProcess - Writing the host postprocess report in /tmp/jonathan/cea3632bcf87/perf_results//host/device_alignment_th_3424747444_proc_52758.json
2024-03-04 14:29:24.855 | INFO     | Runtime         - Closed all devices successfully
2024-03-04 14:29:24.855 | INFO     | PerfCheck       - Starting performance check for host events
2024-03-04 14:29:24.855 | INFO     | PerfCheck       - Finished host performance check successfully
2024-03-04 14:29:24.855 | INFO     | Debuda          - Debug server ended on 
2024-03-04 14:29:24.855 | INFO     | Always          - finish_child_process called on pid 52758
2024-03-04 14:29:24.856 | DEBUG    | pybuda.device:atexit_handler:919 - atexit handler called for (TTDevice 'tt0',)
2024-03-04 14:29:24.856 | INFO     | Always          - finish_child_process called on pid 52758
2024-03-04 14:29:24.856 | DEBUG    | pybuda.device:atexit_handler:923 - atexit handler completed
2024-03-04 14:29:25.527 | INFO     | Always          - finish_child_process called on pid 52709

but when its=10000, the end shutdown hangs:

PyBUDA forward pass ..... 9999
2024-03-04 14:31:29.522 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt0'
2024-03-04 14:31:29.522 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt0' / 54932
2024-03-04 14:31:29.522 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt0'
2024-03-04 14:31:29.523 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-04 14:31:29.523 | INFO     | Runtime         - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
2024-03-04 14:31:29.523 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue direct_pt.output_add_5
2024-03-04 14:31:29.523 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-04 14:31:29.524 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue input_1
2024-03-04 14:31:29.525 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-04 14:31:29.525 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue direct_pt.output_add_5
[PyBuda Tensor: tensor([[-0.3320]], dtype=torch.bfloat16, requires_grad=True), DataFormat.Float16_b]
2024-03-04 14:31:29.527 | DEBUG    | pybuda.run.impl:_shutdown:1262 - PyBuda shutdown
2024-03-04 14:31:29.528 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on TTDevice 'tt0'
2024-03-04 14:31:29.528 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on TTDevice 'tt0'
2024-03-04 14:31:29.528 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on TTDevice 'tt0'
2024-03-04 14:31:29.528 | DEBUG    | pybuda.run.impl:_shutdown:1278 - Waiting until processes done
2024-03-04 14:31:29.528 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: quit
2024-03-04 14:31:29.529 | INFO     | Runtime         - Waiting for cluster completion
2024-03-04 14:31:29.530 | INFO     | PerfPostProcess - Writing the host postprocess report in /tmp/jonathan/cea3632bcf87/perf_results//host/device_alignment_th_3671482638_proc_54932.json
2024-03-04 14:31:29.576 | INFO     | Runtime         - Closed all devices successfully
2024-03-04 14:31:29.576 | INFO     | PerfCheck       - Starting performance check for host events
2024-03-04 14:31:29.576 | INFO     | PerfCheck       - Finished host performance check successfully
2024-03-04 14:31:29.576 | INFO     | Debuda          - Debug server ended on 
2024-03-04 14:31:29.577 | INFO     | Always          - finish_child_process called on pid 54932

milank94 commented 4 months ago

But why would pybuda_shutdown() sequence be much longer when its=10000 then when its=1000?

My apologies, I miss read the 10,000..

Can you show a snippet of what you see when it is 10,000?

JonathanALevine commented 4 months ago

No worries! I added extra infos. See above.

milank94 commented 4 months ago

I see, you're missing the last two Debug statements:

DEBUG    | pybuda.device:atexit_handler:919 - atexit handler called for (TTDevice 'tt0',)
INFO     | Always          - finish_child_process called on pid
DEBUG    | pybuda.device:atexit_handler:923 - atexit handler completed
INFO     | Always          - finish_child_process called on pid

Two questions to help debug:

Does it eventually fully shutdown like the 1000 sample case or is it it forever stuck there?
If you manually kill the process after it's stuck there, is the card in a bad state i.e. do you need to do a reset?

JonathanALevine commented 4 months ago

Does it eventually fully shutdown like the 1000 sample case or is it it forever stuck there?

Running it again, I will let it be for a while (maybe an hour?) and see if the card shuts down on its own.

If you manually kill the process after it's stuck there, is the card in a bad state i.e. do you need to do a reset?

If I manually kill the process with 'CTRL+C' in terminal, the card seems to be fine. I can rerun the script no problem with its=1000.

Addtionally I will change how myNet() is defined to only use PyBUDA and no PyTorch and see if that makes any difference.

JonathanALevine commented 4 months ago

Does it eventually fully shutdown like the 1000 sample case or is it it forever stuck there?

Running it again, I will let it be for a while (maybe an hour?) and see if the card shuts down on its own.

Let it attempt to shutdown on its for over an hour. The card never shutdown.

If you manually kill the process after it's stuck there, is the card in a bad state i.e. do you need to do a reset?

If I manually kill the process with 'CTRL+C' in terminal, the card seems to be fine. I can rerun the script no problem with its=1000.

Addtionally I will change how myNet() is defined to only use PyBUDA and no PyTorch and see if that makes any difference.

Running inference multiple times like this: https://github.com/JonathanALevine/tt-buda/blob/main/inference_stress_tester.py seems to run as expected.

tenstorrent / tt-buda

Grayskull e150 hangs during long inference loop #12