seung-lab / igneous

Scalable Neuroglancer compatible Downsampling, Meshing, Skeletonizing, Contrast Normalization, Transfers and more.
GNU General Public License v3.0
43 stars 17 forks source link

Troubleshooting with ptq #167

Open manoaman opened 7 months ago

manoaman commented 7 months ago

Hi @william-silversmith ,

I have a situation where my igneous execution is stuck at one point and does not seem to progress. And I don't see any notable logs or outputs. Would you be able to guide me how to troubleshoot what is causing the issue?

Thanks, -m

ptq status ./queue/                                                                                                                                                                                                                             

Inserted: 5586
Enqueued: 1 (0.0% left)
Completed: 5617 (100.6%)
Leased: 1 (100.0% of queue)
william-silversmith commented 7 months ago

Hi m! What kind of task are you running? Can you let me know what its parameters are?

Is your CPU cooking or your network or disk working?

manoaman commented 7 months ago

Hi Will,

Notable parameters used in xfer task:

igneous image xfer --mip 0 --chunk-size 128,128,64 --fill-missing --sharded

Utilizing 36 cpu cores.

Indeed, the disk has been unstable at times. Although, both cpu and network seem fine to me. Any suggestions how to identify the cause?

william-silversmith commented 7 months ago

How's your memory usage? Sharded transfers can potentially use a lot of memory. If you start swapping, that would cause low utilization of network, cpu, but weird access patterns to disk.

Try setting the memory parameter lower (which will create more shards but makes each task smaller).

manoaman commented 7 months ago

I'm using a compute node which has 36 cpu cores, and 1TB memory. I understand the default is 3.5GB. Should I set much smaller than the default value? Maybe 1GB? (--memory 1000000000)

  --memory INTEGER                Task memory limit in bytes. Task shape will
                                  be chosen to fit and maximize downsamples.
                                  [default: 3500000000.0]
william-silversmith commented 7 months ago

Hmm... 1TB should be more than enough. Can you check how much RAM is being used?

manoaman commented 7 months ago

I reset the queue and running once again. From looking at htop summary, total memory usage is fluctuating between 130GB~150GB at the moment. Gradually increasing.

william-silversmith commented 7 months ago

If it gets stuck again and RAM isn't a problem, one thing you can try is turning off parallel and see if it executes.

manoaman commented 7 months ago

Hi @william-silversmith ,

I tried running without -p 36 parameter this time after observing another hang, executing igneous still seems to stuck at same state from looking at ptq status.

Inserted: 5586
Enqueued: 1 (0.0% left)
Completed: 5610 (100.4%)
Leased: 1 (100.0% of queue)

htop command shows there are two processes and one is kept running at CPU 0.0%, MEM 0.1%. and another process doesn't seem to be using any resources. Both are showing "S" which seems to be sleeping?

-m

manoaman commented 7 months ago

Testing to see if the storage has a problem by switching to another storage.

william-silversmith commented 7 months ago

This is a good strategy! Let me know how it goes.

manoaman commented 7 months ago

Okay, so this definitely had something to do with the storage. When I switched over to a different storage, I do not see this issue where igneous hangs.