Open GMFranceschini opened 7 months ago
Hi, Have you solved your problem? I had the same problem.
Thanks!
Unfortunately, it has not been solved so far.
For debugging purpose, if you set, OMP_NUM_THREADS=1 in environment, would that change the observed behavior? I agree that this multiprocessing + global variable function is probably the stuck point.
So, in the environment, I do:
export OMP_NUM_THREADS=1
Then, I run the script, but it gets stuck as before with no additional info. Also, I tried to do:
train_pool = ProcessPoolExecutor(max_workers=1)
But I have the same problem.
I know that some iterations run until the end of one_threads_generate_neg()
because I added a print right before the return. In the case of a single process, a couple of iterations are complete, and then it hangs.
I am currently trying to use pdb
to get a more precise idea of where the problem is.
This is what I get when I do a keyboard interrupt with CTRL+C (I don't know if this is useful):
^CTraceback (most recent call last):
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/schic/Higashi/scripts/WGD_FH.py", line 37, in <module>
higashi_model.train_for_imputation_nbr_0()
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1336, in train_for_imputation_nbr_0
self.train_for_imputation_no_nbr()
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1394, in train_for_imputation_no_nbr
self.train(
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1099, in train
bce_loss, mse_loss, train_accu, auc1, auc2, str1, str2, train_pool, train_p_list = self.train_epoch(
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 897, in train_epoch
for p in as_completed(train_p_list):
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 581, in wait
signaled = self._cond.wait(timeout)
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1447, in _shutdown
atexit_call()
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/concurrent/futures/process.py", line 95, in _python_exit
t.join()
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/work/FAC/FBM/DBC/gciriell/default/gianmarco/tools/mamba_root/envs/higashi_local/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
That's very strange.... Cuz these 3 parameters would have default values anyway...
There's a problem. I can't run it now without setting those three parameters. I feel like there's something wrong with my environment.
If I can help you identify the problem somehow, please let me know. I'm happy to put some time into it.
Good morning, Asking once more for your input on deploying Higashi! I have set up a test run that I could efficiently run locally on my NVidia GPU (chr21, 1mln bp windows). Everything works on my laptop (16Gb VRAM, intel i7). However, as I move the computation to our SLURM cluster, the process is stuck at
higashi_model.train_for_imputation_nbr_0()
like this:I have replicated my local environment on the cluster higashi_env.txt. I also contacted our cluster support, but it seems very hard to understand what is happening here, especially given that the process hangs and no error is thrown. I am using 16 CPUs per task and have access to 40GB of VRAM, so resource-wise, this should not be a problem. I checked if
torch
is working correctly, andtorch.cuda.is_available()
returns true.As I mentioned in a previous issue, I had to force the
chosen_id=0
inget_free_gpu()
instances throughout the code in order to prevent unexpected behavior when checking available memory on multiple GPUs. I installed Higashi and Fast-Higashi, cloning the repo and runningpip install .
. Please let me know if you have any idea how I could debug this. I appreciate your responses a lot as I look forward to having imputed matrices with Higashi!EDIT: our HPC expert suggested this function might be the stuck point, but I haven't checked myself https://github.com/ma-compbio/Higashi/blob/1333de29ac1d808906d81409176c7dbd0cf2558f/higashi/Higashi_wrapper.py#L236