longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.98k stars 590 forks source link

[BUG] Stucking hang occurs when repeatly reading files #8150

Open ziippy opened 6 months ago

ziippy commented 6 months ago

I'm using longhorn v1.6.0 and I create the volume with replica 2

I am training an AI model by reading image files from a Longhorn volume, and recently, the training often hangs unexpectedly.

I am reading a total of 500,000 image files. and repeating this process more than 300 times. because AI model training is based on iteration.

Sometimes it hangs after just 10 iterations, and other times it hangs after 100 iterations., and random occured.

When learning no longer progresses, I go to the path and try "ls" or "cp", a hang occurs in the shell. image

To recover this, I did the following: 1) I deleted the replica to force a rebuild ---> but it did not recover. image image

2) I forcibly deleted the share-manager pod for the volume ---> but which caused the My pod that was mounting the volume to be forcibly restarted as well. ---> After my Pod is restarted, I go to that path, "ls" works well, and training goes well. image image image

But, I does not want restart My Pods. (because many Pods using same volume) And, I want to get rid of the anxiety that learning will hang again.

Why is this happening? What could be the reason? What should I suspect?

ejweber commented 6 months ago

It is expected that:

I think we will need a support bundle to help understand what is going on here. Hopefully either the share-manager pod logs something interesting as it is hanging or the kernel on the node(s) that see the hang. Can you take one after the hang occurs (before attempting to resolve it) and post it here or send it to longhorn-support-bundle@suse.com?

A couple of clarifying questions:

ejweber commented 6 months ago

Potentially related to https://github.com/longhorn/longhorn/issues/4195.