Open ziippy opened 8 months ago
It is expected that:
I think we will need a support bundle to help understand what is going on here. Hopefully either the share-manager pod logs something interesting as it is hanging or the kernel on the node(s) that see the hang. Can you take one after the hang occurs (before attempting to resolve it) and post it here or send it to longhorn-support-bundle@suse.com?
A couple of clarifying questions:
Potentially related to https://github.com/longhorn/longhorn/issues/4195.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I'm using longhorn v1.6.0 and I create the volume with replica 2
I am training an AI model by reading image files from a Longhorn volume, and recently, the training often hangs unexpectedly.
I am reading a total of 500,000 image files. and repeating this process more than 300 times. because AI model training is based on iteration.
Sometimes it hangs after just 10 iterations, and other times it hangs after 100 iterations., and random occured.
When learning no longer progresses, I go to the path and try "ls" or "cp", a hang occurs in the shell.
To recover this, I did the following: 1) I deleted the replica to force a rebuild ---> but it did not recover.
2) I forcibly deleted the share-manager pod for the volume ---> but which caused the My pod that was mounting the volume to be forcibly restarted as well. ---> After my Pod is restarted, I go to that path, "ls" works well, and training goes well.
But, I does not want restart My Pods. (because many Pods using same volume) And, I want to get rid of the anxiety that learning will hang again.
Why is this happening? What could be the reason? What should I suspect?