Higashi stuck on training at higashi_model.train_for_imputation_nbr_0() on SLURM system

GMFranceschini commented 7 months ago

Good morning, Asking once more for your input on deploying Higashi! I have set up a test run that I could efficiently run locally on my NVidia GPU (chr21, 1mln bp windows). Everything works on my laptop (16Gb VRAM, intel i7). However, as I move the computation to our SLURM cluster, the process is stuck at higashi_model.train_for_imputation_nbr_0() like this:

Preparing for imputation...
100%|██████████| 1/1 [00:00<00:00, 3305.20it/s]
100%|██████████| 1/1 [00:00<00:00, 10591.68it/s]
pass_pseudo_id False
pass_pseudo_id False
Second stage training

[ Epoch 0 of 45 ]
 - (Training) :   0%|          | 0/1000

I have replicated my local environment on the cluster higashi_env.txt. I also contacted our cluster support, but it seems very hard to understand what is happening here, especially given that the process hangs and no error is thrown. I am using 16 CPUs per task and have access to 40GB of VRAM, so resource-wise, this should not be a problem. I checked if torch is working correctly, and torch.cuda.is_available() returns true.

As I mentioned in a previous issue, I had to force the chosen_id=0 in get_free_gpu() instances throughout the code in order to prevent unexpected behavior when checking available memory on multiple GPUs. I installed Higashi and Fast-Higashi, cloning the repo and running pip install .. Please let me know if you have any idea how I could debug this. I appreciate your responses a lot as I look forward to having imputed matrices with Higashi!

EDIT: our HPC expert suggested this function might be the stuck point, but I haven't checked myself https://github.com/ma-compbio/Higashi/blob/1333de29ac1d808906d81409176c7dbd0cf2558f/higashi/Higashi_wrapper.py#L236

Accompany0313 commented 7 months ago

Hi, Have you solved your problem? I had the same problem.

Thanks！

GMFranceschini commented 7 months ago

Unfortunately, it has not been solved so far.

ruochiz commented 7 months ago

For debugging purpose, if you set, OMP_NUM_THREADS=1 in environment, would that change the observed behavior? I agree that this multiprocessing + global variable function is probably the stuck point.

GMFranceschini commented 7 months ago

So, in the environment, I do:

export OMP_NUM_THREADS=1

Then, I run the script, but it gets stuck as before with no additional info. Also, I tried to do:

train_pool = ProcessPoolExecutor(max_workers=1)

But I have the same problem. I know that some iterations run until the end of one_threads_generate_neg() because I added a print right before the return. In the case of a single process, a couple of iterations are complete, and then it hangs.

I am currently trying to use pdb to get a more precise idea of where the problem is.

This is what I get when I do a keyboard interrupt with CTRL+C (I don't know if this is useful):

^CTraceback (most recent call last):
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/schic/Higashi/scripts/WGD_FH.py", line 37, in <module>
    higashi_model.train_for_imputation_nbr_0()
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1336, in train_for_imputation_nbr_0
    self.train_for_imputation_no_nbr()
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1394, in train_for_imputation_no_nbr
    self.train(
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1099, in train
    bce_loss, mse_loss, train_accu, auc1, auc2, str1, str2, train_pool, train_p_list = self.train_epoch(
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 897, in train_epoch
    for p in as_completed(train_p_list):
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed
    waiter.event.wait(wait_timeout)
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1447, in _shutdown
    atexit_call()
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/concurrent/futures/process.py", line 95, in _python_exit
    t.join()
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

ruochiz commented 7 months ago

That's very strange.... Cuz these 3 parameters would have default values anyway...

Accompany0313 commented 7 months ago

There's a problem. I can't run it now without setting those three parameters. I feel like there's something wrong with my environment.

GMFranceschini commented 7 months ago

If I can help you identify the problem somehow, please let me know. I'm happy to put some time into it.

ma-compbio / Higashi

Higashi stuck on training at higashi_model.train_for_imputation_nbr_0() on SLURM system #49