Open gfetterman opened 5 years ago
Is the hang reliable? I.e does it happen for all files above a certain size, or only a subset?
Sent from my phone
On Apr 8, 2019, at 6:24 PM, Graham Fetterman notifications@github.com wrote:
When we run ms4alg.sort with certain combinations of (a) file length, (b) sorting parameters, and (c) number of threads, the process hangs.
When this happens, it reliably happens after the "Reassigning events" phase - i.e., the last status update line in the terminal reads Reassigning events for channel [channel_number] (phase 1). Once this occurs, the process never advances (we've let it sit there for >24 hours with no change; usual run time on "good" inputs is ~4-5hr). The number of workers visible in top remains constant, as does memory usage. However, the workers are not using any CPU time.
We can conclude three things:
file length is a factor: ~1hr-long files complete just fine. Longer files (3-4 hours) hang. parameter choice is a factor: {adjacency_radius: 100, detect_threshold: 2, detect_interval: 5} will complete without a problem. {adjacency_radius: 150, detect_threshold: 1, detect_interval: 5} will hang. number of threads is a factor: with num_workers:1, the process completes. With num_workers:12, it hangs. (We're running this on a 12-core machine.) (NB: we've also exchanged a couple of emails on this issue with @tjd2002 .)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
We haven't titrated the size, but the cutoff sits somewhere between 1 hour and 3 hours.
Above that, it's mostly reliable, but there has been the odd time (invariably using what we've taken to thinking of as the "easy" parameter set described above) when a longer file hasn't hung.
Have you checked what the memory usage is during this process (all of those changes would increase it) both in terms of available RAM and in terms of temporary disk space?
Both before and during the hang, each thread is using between 100MB and 2GB of memory. The machine has 64GB of memory, and in total it never rises above about 50% in use. This doesn't appear to vary significantly between the two file sizes.
MountainSort does appear to be using a significant volume of temporary disk space - on the order of 4-5x the size of the file being sorted.
@gfetterman was this ever resolved for you?
Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.
Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.
Just want to come to say that I encountered the same problem. I have an old conda env that's working properly, but then I tried to create a new one then it is already stuck at the PCA step. I enclose two environment files for reference:
Working env: working.txt
Not working env: notworking.txt
When we run
ms4alg.sort
with certain combinations of (a) file length, (b) sorting parameters, and (c) number of threads, the process hangs.When this happens, it reliably happens after the "Reassigning events" phase - i.e., the last status update line in the terminal reads
Reassigning events for channel [channel_number] (phase 1)
. Once this occurs, the process never advances (we've let it sit there for >24 hours with no change; usual run time on "good" inputs is ~4-5hr). The number of workers visible intop
remains constant, as does memory usage. However, the workers are not using any CPU time.We can conclude three things:
{adjacency_radius: 100, detect_threshold: 2, detect_interval: 5}
will complete without a problem.{adjacency_radius: 150, detect_threshold: 1, detect_interval: 5}
will hang.num_workers:1
, the process completes. Withnum_workers:12
, it hangs. (We're running this on a 12-core machine.)(NB: we've also exchanged a couple of emails on this issue with @tjd2002 .)