Higahi stop at train_for_imputation_nbr_0 on both API and CLI.

EddieLv commented 2 years ago

Hi, ruochi! My higashi work smoothly until it comes with train for imputation, and there is no error or warning. 截图 2022-05-03 20-03-28 截图 2022-05-03 20-03-42 And here is my higashi experiment: Package Version

asciitree 0.3.3 asttokens 2.0.5 attrs 21.4.0 backcall 0.2.0 bleach 5.0.0 bokeh 3.0.0.dev5 brotlipy 0.7.0 certifi 2021.10.8 cffi 1.15.0 charset-normalizer 2.0.4 click 8.1.2 cooler 0.8.11 cryptography 36.0.0 cycler 0.11.0 Cython 3.0.0a10 cytoolz 0.10.1 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.4 entrypoints 0.4 executing 0.8.3 fastjsonschema 2.15.3 fbpca 1.0 fonttools 4.33.2 h5py 3.6.0 higashi 0.1.0a0 idna 3.3 importlib-metadata 4.11.3 importlib-resources 5.7.1 ipykernel 6.9.1 ipython 8.2.0 ipython-genutils 0.2.0 ipywidgets 7.7.0 jedi 0.18.1 Jinja2 3.1.1 joblib 1.1.0 jsonschema 4.4.0 jupyter-client 7.2.2 jupyter-core 4.9.2 jupyterlab-widgets 1.1.0 kiwisolver 1.4.2 llvmlite 0.38.0 MarkupSafe 2.0.1 matplotlib 3.5.1 matplotlib-inline 0.1.2 mistune 0.8.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 multiprocess 0.70.12.2 nbconvert 5.6.1 nbformat 5.3.0 nest-asyncio 1.5.5 notebook 5.7.11 numba 0.55.1 numpy 1.21.5 packaging 21.3 pandas 1.3.4 pandocfilters 1.5.0 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.0.1 pip 21.2.4 prometheus-client 0.14.1 prompt-toolkit 3.0.20 ptyprocess 0.7.0 pure-eval 0.2.2 pycparser 2.21 pyfaidx 0.6.4 Pygments 2.11.2 pynndescent 0.5.6 pyOpenSSL 22.0.0 pypairix 0.3.7 pyparsing 3.0.8 pyrsistent 0.18.0 PySocks 1.7.1 python-dateutil 2.8.2 pytz 2022.1 PyYAML 6.0 pyzmq 22.3.0 requests 2.27.1 scikit-learn 1.0.2 scipy 1.7.3 seaborn 0.11.2 Send2Trash 1.8.0 setuptools 61.2.0 simplejson 3.17.6 six 1.16.0 stack-data 0.2.0 terminado 0.13.3 testpath 0.6.0 threadpoolctl 3.1.0 toolz 0.11.2 torch 1.11.0 torchaudio 0.11.0 torchvision 0.12.0 tornado 6.1 tqdm 4.64.0 traitlets 5.1.1 typing_extensions 4.1.1 umap-learn 0.5.3 urllib3 1.26.9 wcwidth 0.2.5 webencodings 0.5.1 wheel 0.37.1 widgetsnbextension 3.6.0 xyzservices 2022.4.0 zipp 3.8.0

ruochiz commented 2 years ago

Thank you for your interest in Higashi! When using it through the CLI mode, did it just hang like this (stuck at 0% without any error), or it would quit with an error information? If it's the former one, could you help to attach the log when you kill the process (ctrl+c), such that I can try to figure which process is hanging? Thanks!

EddieLv commented 2 years ago

This is the error, or do you need the complete log file?

(Training) : 0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 1362, in train(higashi_model, File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 653, in train bce_loss, mse_loss, train_accu, auc1, auc2, str1, str2, train_pool, train_p_list = train_epoch( File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 124, in train_epoch for p in as_completed(train_p_list): File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed waiter.event.wait(wait_timeout) File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 574, in wait signaled = self._cond.wait(timeout) File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 312, in wait waiter.acquire() KeyboardInterrupt Exception ignored in: <module 'threading' from '/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py'> Traceback (most recent call last): File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1411, in _shutdown atexit_call() File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 95, in _python_exit t.join() File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1029, in join self._wait_for_tstate_lock() File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1045, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt: Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt

EddieLv commented 2 years ago

Hi, ruochi. I wonder if the bug is related with the pytorch version? Or I actually did not install higashi through git successfully?

ruochiz commented 2 years ago

I don't think it has to do with torch version as 1.11.0 is sth I have tested on. The deadlock seems to be triggered by the multiprocessing part. I will run some test on my end. Meanwhile could you share the config.JSON file you created for this run? Thx.

EddieLv commented 2 years ago

{ "config_name": "Cere-24-20220416", "data_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/7_higashi_input", "input_format": "higashi_v1", "structured": "true", "temp_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out", "genome_reference_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39.chr.sizes.txt", "cytoband_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39_cytoband.txt", "chrom_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"], "resolution": 1000000, "resolution_cell": 1000000, "local_transfer_range": 1, "dimensions": 64, "loss_mode": "zinb", "rank_thres": 1, "embedding_epoch": 80, "no_nbr_epoch": 80, "with_nbr_epoch": 60, "embedding_name": "Cere-24-20220416_zinb", "impute_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"], "minimum_distance": 1000000, "maximum_distance": -1, "neighbor_num": 5, "cpu_num": -1, "gpu_num": 0, "UMAP_params": {"n_neighbors": 20} }

EddieLv commented 2 years ago

And my python version is 3.9.0. :)

ruochiz commented 2 years ago

Hi, I just updated the code base (specifically the main_cell.py file). Could you try to set the cpu_num as 1, run Higashi with the CLI approach (python higashi/main_cell.py -c ../...JSON -s 2)? The -s 2 will make sure the program starts at the training for imputation step. Setting cpu_num = 1 in the JSON file will disable the multiprocessing. Let's see if there will be any error without using multiprocessing. If it hangs again, interrupt it and attach the logs please. Thx.

EddieLv commented 2 years ago

It seems work, ruochi.

0%| | 0/24 [00:00<?, ?it/s] 100%|██████████| 24/24 [00:00<00:00, 412554.49it/s]

0%| | 0/24 [00:00<?, ?it/s] 100%|██████████| 24/24 [00:00<00:00, 521571.48it/s]

0%| | 0/24 [00:00<?, ?it/s] 25%|██▌ | 6/24 [00:00<00:00, 57.71it/s] 100%|██████████| 24/24 [00:00<00:00, 123.71it/s]

0%| | 0/24 [00:00<?, ?it/s] 100%|██████████| 24/24 [00:00<00:00, 759.27it/s]

(Training) : 0%| | 0/1000 [00:00<?, ?it/s]
(Training) : 0%| | 1/1000 [00:00<04:23, 3.79it/s]
(Training) BCE: 0.797 MSE: 0.000 Loss: 0.797 norm_ratio: 0.00: 0%| | 2/1000 [00:00<03:47, 4.39it/s]
(Training) BCE: 0.879 MSE: 0.000 Loss: 0.879 norm_ratio: 0.00: 0%| | 3/1000 [00:00<03:49, 4.34it/s]
(Training) BCE: 0.870 MSE: 0.000 Loss: 0.870 norm_ratio: 0.00: 0%| | 4/1000 [00:00<03:52, 4.29it/s]
(Training) BCE: 0.818 MSE: 0.000 Loss: 0.818 norm_ratio: 0.00: 0%| | 5/1000 [00:01<03:39, 4.53it/s]
(Training) BCE: 0.781 MSE: 0.000 Loss: 0.781 norm_ratio: 0.00: 1%| | 6/1000 [00:01<03:46, 4.38it/s]
(Training) BCE: 0.775 MSE: 0.000 Loss: 0.775 norm_ratio: 0.00: 1%| | 7/1000 [00:01<03:50, 4.32it/s]
(Training) BCE: 0.822 MSE: 0.000 Loss: 0.822 norm_ratio: 0.00: 1%| | 8/1000 [00:01<03:42, 4.46it/s]
(Training) BCE: 0.766 MSE: 0.000 Loss: 0.766 norm_ratio: 0.00: 1%| | 9/1000 [00:02<03:46, 4.37it/s]
(Training) BCE: 0.827 MSE: 0.000 Loss: 0.827 norm_ratio: 0.00: 1%| | 10/1000 [00:02<03:41, 4.48it/s]
(Training) BCE: 0.849 MSE: 0.000 Loss: 0.849 norm_ratio: 0.00: 1%| | 11/1000 [00:02<03:47, 4.34it/s]
(Training) BCE: 0.834 MSE: 0.000 Loss: 0.834 norm_ratio: 0.00: 1%| | 12/1000 [00:02<03:56, 4.18it/s]
(Training) BCE: 0.748 MSE: 0.000 Loss: 0.748 norm_ratio: 0.00: 1%|▏ | 13/1000 [00:03<04:08, 3.96it/s]
(Training) BCE: 0.856 MSE: 0.000 Loss: 0.856 norm_ratio: 0.00: 1%|▏ | 14/1000 [00:03<03:58, 4.13it/s]

But what if I wanna use multi cpu?

EddieLv commented 2 years ago

And I test it with cpu:-1, the same error occurs.

ruochiz commented 2 years ago

That's... unexpected... the cpu=1 is just used to debug... I thought the error would persist. It's just easier to debug without multiprocessing. What if you do cpu:2 or cpu:3? Would that trigger the error?

EddieLv commented 2 years ago

Yeap... I tried cpu=2,8, and that trigger the same error, but cpu=1 can work.

ruochiz commented 2 years ago

Let me try to run the code on my cpu server and get back to you. If cpu=1 can work then it has nothing to do with the data itself. I have sth that I suspect might be the reason though. Will get back with more details.

EddieLv commented 2 years ago

I found that it actually created multi process, but the process seemed sleeping. 截图 2022-05-05 10-02-05

EddieLv commented 2 years ago

Hi, ruochi. How is the question solved?

ruochiz commented 2 years ago

Sorry for the late reply. I was on a trip. I tested it on the cpu machine I have (linux). The multiprocessing seems to be working fine. I am planning to test it on a windows PC. The configuration of the environment takes a while as I never used that PC to run python program before. I will post an update later.

EddieLv commented 2 years ago

Hi,ruochi. My computer is linux as well, I wonder if I did not install higashi successfully actually? Recently I met with some problems more, 1.when I set cpu=1 and run CLI, the .err file is 0%| | 0/19 [00:00<?, ?it/s] 100%|██████████| 19/19 [00:00<00:00, 520861.28it/s]

0%| | 0/19 [00:00<?, ?it/s] 100%|██████████| 19/19 [00:00<00:00, 664098.13it/s] Traceback (most recent call last): File "main_cell.py", line 1328, in checkpoint = torch.load(save_path+"_stage1", map_location=current_device) File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 699, in load with _open_file_like(f, 'rb') as opened_file: File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 231, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 212, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out/model/model.chkpt_stage1' the part of .log file is layer_norm1.weight True torch.Size([64]) layer_norm1.bias True torch.Size([64]) layer_norm2.weight True torch.Size([64]) layer_norm2.bias True torch.Size([64]) extra_proba.w_stack.0.weight True torch.Size([4, 41]) extra_proba.w_stack.0.bias True torch.Size([4]) extra_proba.w_stack.1.weight True torch.Size([1, 4]) extra_proba.w_stack.1.bias True torch.Size([1]) extra_proba2.w_stack.0.weight True torch.Size([4, 41]) extra_proba2.w_stack.0.bias True torch.Size([4]) extra_proba2.w_stack.1.weight True torch.Size([1, 4]) extra_proba2.w_stack.1.bias True torch.Size([1]) extra_proba3.w_stack.0.weight True torch.Size([4, 41]) extra_proba3.w_stack.0.bias True torch.Size([4]) extra_proba3.w_stack.1.weight True torch.Size([1, 4]) extra_proba3.w_stack.1.bias True torch.Size([1]) attribute_dict_embedding.weight False torch.Size([4826, 20]) params to be trained 738082 initializing data generator initializing data generator 2.when I set cpu=1 in jupyter notebook 截图 2022-05-12 08-41-42 截图 2022-05-12 08-41-53 It seems higashi broke out in the imputation? I'm confused about the error of str and int, because the imputation process had already run for a time.

ruochiz commented 2 years ago

These two are triggered by different reasons. For the first one, it's caused by that there is not stage 1 model trained for that JSON. If you didn't trained the model before when using CLI mode, you should do python main_cell.py -c xxx -s 1 instead of -s 2

For the second one, the error is triggered by that the cytoband file you provided contains str in the "start" column. Could you attach your cytoband file here for reference? I can push a fix soon to make the code more compatible when encountering str in the "start" column, but it would be helpful to see why would there be a str.

EddieLv commented 2 years ago

OK, here is my cytoband file. GRCm39_cytoband.txt

ruochiz commented 2 years ago

Ah. I know, it's because the first line #chrom, chromStart, chromEnd are interpreted as the content not the header. Delete the first line, the code should be fine. The cytoband file I downloaded from UCSD doesn't contain header and that's why I thought it wouldn't have the header by default. I can add some code to make sure the program ignore line that start with #.

EddieLv commented 2 years ago

OK~thanks

ruochiz commented 2 years ago

I just added some code to support a new parameter in the JSON file. If you set "cpu_num_torch": -1, but "cpu_num":1. The code should still utilizes multiprocessing for pytorch training, but only one cpu process for generating training batches. This is a temporary solution, and is not as optimized as the original version. But since I cannot replicate the error on my end. I would have to guess what triggers the error, which could take a while.

I will close this issue for now. But if I have more updates, I will posted it here.

ma-compbio / Higashi

Higahi stop at train_for_imputation_nbr_0 on both API and CLI. #18