Process hangs on certain inputs

gfetterman commented 5 years ago

When we run ms4alg.sort with certain combinations of (a) file length, (b) sorting parameters, and (c) number of threads, the process hangs.

When this happens, it reliably happens after the "Reassigning events" phase - i.e., the last status update line in the terminal reads Reassigning events for channel [channel_number] (phase 1). Once this occurs, the process never advances (we've let it sit there for >24 hours with no change; usual run time on "good" inputs is ~4-5hr). The number of workers visible in top remains constant, as does memory usage. However, the workers are not using any CPU time.

We can conclude three things:

file length is a factor: ~1hr-long files complete just fine. Longer files (3-4 hours) hang.
parameter choice is a factor: {adjacency_radius: 100, detect_threshold: 2, detect_interval: 5} will complete without a problem. {adjacency_radius: 150, detect_threshold: 1, detect_interval: 5} will hang.
number of threads is a factor: with num_workers:1, the process completes. With num_workers:12, it hangs. (We're running this on a 12-core machine.)

(NB: we've also exchanged a couple of emails on this issue with @tjd2002 .)

tjd2002 commented 5 years ago

Is the hang reliable? I.e does it happen for all files above a certain size, or only a subset?

Sent from my phone

On Apr 8, 2019, at 6:24 PM, Graham Fetterman notifications@github.com wrote:

When we run ms4alg.sort with certain combinations of (a) file length, (b) sorting parameters, and (c) number of threads, the process hangs.

When this happens, it reliably happens after the "Reassigning events" phase - i.e., the last status update line in the terminal reads Reassigning events for channel [channel_number] (phase 1). Once this occurs, the process never advances (we've let it sit there for >24 hours with no change; usual run time on "good" inputs is ~4-5hr). The number of workers visible in top remains constant, as does memory usage. However, the workers are not using any CPU time.

We can conclude three things:

file length is a factor: ~1hr-long files complete just fine. Longer files (3-4 hours) hang. parameter choice is a factor: {adjacency_radius: 100, detect_threshold: 2, detect_interval: 5} will complete without a problem. {adjacency_radius: 150, detect_threshold: 1, detect_interval: 5} will hang. number of threads is a factor: with num_workers:1, the process completes. With num_workers:12, it hangs. (We're running this on a 12-core machine.) (NB: we've also exchanged a couple of emails on this issue with @tjd2002 .)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

gfetterman commented 5 years ago

We haven't titrated the size, but the cutoff sits somewhere between 1 hour and 3 hours.

Above that, it's mostly reliable, but there has been the odd time (invariably using what we've taken to thinking of as the "easy" parameter set described above) when a longer file hasn't hung.

alexmorley commented 5 years ago

Have you checked what the memory usage is during this process (all of those changes would increase it) both in terms of available RAM and in terms of temporary disk space?

gfetterman commented 5 years ago

Both before and during the hang, each thread is using between 100MB and 2GB of memory. The machine has 64GB of memory, and in total it never rises above about 50% in use. This doesn't appear to vary significantly between the two file sizes.

MountainSort does appear to be using a significant volume of temporary disk space - on the order of 4-5x the size of the file being sorted.

magland commented 5 years ago

@gfetterman was this ever resolved for you?

mafrasiabi commented 4 years ago

Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.

teristam commented 4 years ago

Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.

Just want to come to say that I encountered the same problem. I have an old conda env that's working properly, but then I tried to create a new one then it is already stuck at the PCA step. I enclose two environment files for reference:

Working env: working.txt

Not working env: notworking.txt

magland / ml_ms4alg

Process hangs on certain inputs #15