lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm
Apache License 2.0
0 stars 0 forks source link

kernel: watchdog: BUG: soft lockup #11

Closed jaycedowell closed 11 months ago

jaycedowell commented 1 year ago

These seem to have started on ndp now that the system is under some load. I saw them on cetus as well but never came up with a good solution.

jaycedowell commented 1 year ago

I haven't seen this on any node other than the head node.

jaycedowell commented 1 year ago

This could be a GPU problem of some sort. If I run the T-engines as packet capture only then I don't get the lockup. The lockup message also always (?) references a python3 process which points to the T-engine's GPU usage. Should we try to clean the pins/reseat the GPUs/NICs on the next site trip?

dentalfloss1 commented 11 months ago

As I read into this it sounds like a driver/firmware issue. Updating the BIOS didn't fix cetus for this same issue. However, most of the forum threads on this topic talk about changing nvidia drivers to try to fix it.

jaycedowell commented 11 months ago

Yeah, worth a shot. Both cetus and ndp servers are on 525.125.06.

jaycedowell commented 11 months ago

There's an interesting pattern emerging here. I don't get soft lockups on the T-engine if I don't use the PFB inverter (even running all four beams). I do get the soft lockups on ndp1 and ndp2 if I try large buffers in the cuda_host memory space. This points to something related to large memory allocations.

jaycedowell commented 11 months ago

I'm now not seeing this problem even with the PFB running. I'm going to say that this is just a symptom of the non-optimal memory use that we had.