[BUG] Stucking hang occurs when repeatly reading files

ziippy commented 8 months ago

I'm using longhorn v1.6.0 and I create the volume with replica 2

I am training an AI model by reading image files from a Longhorn volume, and recently, the training often hangs unexpectedly.

I am reading a total of 500,000 image files. and repeating this process more than 300 times. because AI model training is based on iteration.

Sometimes it hangs after just 10 iterations, and other times it hangs after 100 iterations., and random occured.

When learning no longer progresses, I go to the path and try "ls" or "cp", a hang occurs in the shell.

To recover this, I did the following: 1) I deleted the replica to force a rebuild ---> but it did not recover.

2) I forcibly deleted the share-manager pod for the volume ---> but which caused the My pod that was mounting the volume to be forcibly restarted as well. ---> After my Pod is restarted, I go to that path, "ls" works well, and training goes well.

But, I does not want restart My Pods. (because many Pods using same volume) And, I want to get rid of the anxiety that learning will hang again.

Why is this happening? What could be the reason? What should I suspect?

ejweber commented 8 months ago

It is expected that:

Deleting one replica does not "fix" or otherwise affect the volume. If a replica needed to be rebuilt, the engine would almost certainly trigger this operation itself. Deleting one replica while others are healthy does not significantly disrupt data flow.
Deleting the share manager causes the pod to be deleted. This is by design. When the share manager is gone, the volume is effectively broken. Longhorn deletes workload pods so that the new pods can remount the volume using the new share manager.

I think we will need a support bundle to help understand what is going on here. Hopefully either the share-manager pod logs something interesting as it is hanging or the kernel on the node(s) that see the hang. Can you take one after the hang occurs (before attempting to resolve it) and post it here or send it to longhorn-support-bundle@suse.com?

A couple of clarifying questions:

You mention many pods use the same volume. Do all of the pods simultaneously hang? Or are some pods successfully using the volume while others hang?
What is the kernel version and OS distribution of the worker nodes?

ejweber commented 8 months ago

Potentially related to https://github.com/longhorn/longhorn/issues/4195.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

longhorn / longhorn

[BUG] Stucking hang occurs when repeatly reading files #8150