Closed JMBokhorst closed 3 years ago
@JMBokhorst This is certainly very new to me. Can you detail the running settings ? At the moment, can you also check the number of files opened by the processes, or #hanging processes, as it seems the error propagated from the OS.
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'
There is an output log file too, so please also attach for reference.
@JMBokhorst
Can you confirm that you are running one script or are you processing multiple WSIs in parallel?
@JMBokhorst could you pull down the PR and check if that fixes the issue ?
@simongraham and @vqdang,
Thanks for the quick response. I will check out the PR now and see if it fixes the issue.
I'm trying to run the script on a folder containing a single ndpi image. I use this command (based on the run_wsi.sh script):
python3.7 run_infer.py --gpu='0,1' --nr_types=6 --type_info_path=type_info.json --batch_size=64 --model_mode=fast --model_path=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hovernet_fast_pannuke_type_tf2pytorch.tar --nr_inference_workers=8 --nr_post_proc_workers=16 wsi --input_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/test_image/ --output_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/result/ --save_thumb --save_mask
Below is the output of the debug log, is this the log file you are referring to?
|2021-01-06|11:46:06.636| [INFO] ................ Process: TB_S02_P005_C0001_L15_A15
|2021-01-06|11:46:11.858| [INFO] ................ WARNING: No mask found, generating mask via thresholding at 1.25x!
|2021-01-06|11:46:23.762| [INFO] ........ Preparing Input Output Placement: 17.12366568017751
|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
return recvfds(s, 1)[0]
File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
self.process_single_file(wsi_path, msk_path, self.output_dir)
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
self.__get_raw_prediction(chunk_info_list, patch_info_list)
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
chunk_patch_info_list[:, 0, 0], pbar_desc
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
for batch_idx, batch_data in enumerate(dataloader):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
"Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
Unfortunately, I get the same error with the PR. Since it is happening at the same point, I have put a breakpoint at the point it crashes. I will let you know when I have more information.
@JMBokhorst Could you try this ? Also, what is the size of your WSI ?
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
@vqdang, yes going to try it now.
The image has these dimensions (x, y): 86016, 105728
That also didn't fix the issue, I'm not trying with 0 inference workers to disable multiple processes, see if I get a more clear error
I have set the inference workers to 0, I don't get an error during processing the slide. However, the script hangs as soon as it is done processing the final chunk; no error, no process, nothing.
The error does seem to be related to the inference workers. I will play a bit with the settings and report back :)
Are you on the windows or linux ? Could you check also check with --nr_post_proc_workers 0
to turn off multithread for post proc ?
You can replace this portion to read the cached memmap and comment this line out to prevent rerunning the prediction to speed up the process.
https://github.com/vqdang/hover_net/blob/4978aa5e578c2e32982a3d54197270360630dc4a/infer/wsi.py#L550
I'm running on linux. I re-created the conda env. Now with --nr_post_proc_workers
set to 0 it seems to be working when I just have a very small mask (only 1 tile big). When I use the entire tissue mask I'm running in to some memory issue's. It is always fine until the final part of phase 1. Could you guys tell me how much memory it should use +/-? I have 32GB of memory but maybe that isn't enough.
It is always fine until the final part of phase 1.
Rather than memory issue, this may due to some dead thread making the processes hanging. Can you supply the log ?
And for the memory, because you are using linux, you can adding more swap to avoid OOM problem. Our internal testing done upto 100k x 100k on a system with 128GB ram and 128GB swap. The memory usage will ofc scale with the WSI size, but it should not affect much at the post proc phase. The memory at the post proc scale with the #worker mostly because we copy a tile (should be small, by default its 2048) https://github.com/vqdang/hover_net/blob/08456787d033d4bb1deff478313a8e305805845d/run_infer.py#L66 from mmap back to ram per worker. (1 worker keep memory for 1 tile, 8 workers keep 8). Could you use a TCGA sample for testing so that our side can replicate any further problems?
The inference phase should use more memory, and also may consume lots of hard drive space depending on WSI sizes.
I will try with a slide from TCGA, if you want I can try with one of the slide that you used as well?
I have added the debug log below, it is killed during runtime and at that point I saw only 200MB memory left. I see the memory starts with <8GB but it is steadily increasing when the first phase is almost complete. I will try to increase the SWAP area. The image is relative big 100K X 200K, so that might be an issue.
|2021-01-15|15:13:38.395| [INFO] ........ Preparing Input Output Placement: 7.505539770999803
|2021-01-15|15:15:57.582| [ERROR] Crash
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/local/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1022) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 746, in process_wsi_list
self.process_single_file(wsi_path, msk_path, self.output_dir)
File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 550, in process_single_file
self.__get_raw_prediction(chunk_info_list, patch_info_list)
File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 373, in __get_raw_prediction
patch_output_list = self.__run_model(
File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 287, in __run_model
for batch_idx, batch_data in enumerate(dataloader):
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1022) exited unexpectedly
Just to confirm, can you also say whether you are using the exact environment instructions that we specify on the README?
We will send a couple of links to TCGA slides to test. One big and one small.
Thanks for hanging in there - I’m sure we will sort it ASAP 😊
Yes, or I created a new conda environment and installed the pip requirement file. In addition I had to install pytorch and openslide-python.
I sourced a computer with more memory (64GB) and there the post-processing also seems to work correctly and finish without any issue's :) Let me know if I can help by running it on more slides!
Thanks for the help and quick responses! :)
PS, I saw that you added the MRXS support to the WSI-file handler thanks!
I only noticed that I also needed to change line 732 of infer/wsi.py file for the MRXS support:
old: wsi_path_list = glob.glob(self.input_dir + "/*")
new: wsi_path_list = glob.glob(self.input_dir)
Now I call the script with /path/to/images/*.*
. Needed to do this so the image folder of the MRXS file isn't add to the process list separately.
Please try testing with this WSI - it is quite small
I Will try it out now, I do a test with and without multi-proc and keep an eye out for the mem usage.
That seems to run correctly, with and without multi-proc. With multi-proc the mem usage is roughly 10GB in total, without +/- 2.5GB.
Okay thanks for this - we will look into it this week and get back to you.
I met the same question. When I executed run_infer.py, it throw RuntimeError at the same place, the error message are just like above. When I set --nr_inference_workers=0, the error message disappeared but it takes long time to process a wsi file and run_infer.py seems to take lots of memory.
Thanks @Mengflz - looking into it
@JMBokhorst @Mengflz
Can you please confirm that inference is okay and it is always post processing that crashes.
Initially, I had some issues with inference itself, but that was solved after I re-created the environment. After the update inference always ran well.
Post-processing only crashes when I didn't have enough memory. Reducing the number of post_proc_workers or increasing the memory (physical), solves the issue.
PS. I did had to add pytorch and openslide-python to the requirements.
@JMBokhorst @Mengflz
Can you please confirm that inference is okay and it is always post processing that crashes.
For me, it is not post processing stage but inference stage that crashes. After I set nr_inference_workers=0, this problem solved.
@Mengflz @JMBokhorst Can you guys share with us your system specs?
PS. I did had to add pytorch and openslide-python to the requirements.
So your initial manual installation for pytorch didn't work ?
@Mengflz Did you get OOM in the crash log, or out of file pointers like JMBokhorst ?
@Mengflz @JMBokhorst Can you guys share with us your system specs?
CPU: I7-7700k MEM: 32GB GPU: 1080Ti HDD: 500GB SSD OS: Ubuntu 20.04
@Mengflz @JMBokhorst Can you guys share with us your system specs?
PS. I did had to add pytorch and openslide-python to the requirements.
So your initial manual installation for pytorch didn't work ?
No I didn't have a manual installation. I created a new conda environment and ran pip install -r requirements.txt
but that doesn't install pytorch, so it crashed until I added pytorch
to the requirements. Just wanted to let you know it those packages are missing from the requirements :)
I run this script on the server CPU:Intel(R) Xeon(R) Platinum 8165 CPU @ 2.30GHz MEM: 378G GPU:Tesla K80 OS:Ubuntu 18.04.2
I met crash out of file pointers. Error message on the terminal just like "RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using ulimit -n
in the shell or change the sharing strategy by calling torch.multiprocessing.set_sharing_strategy('file_system')
at the beginning of your code"
@JMBokhorst just to let you know in the README we state to install the libraries as:
conda env create -f environment.yml
conda activate hovernet
pip install torch==1.6.0 torchvision==0.7.0
The first line uses the requirements.txt
but we do not install with pip install -r requirements.txt
. We did not add pytorch on purpose and should be installed as recommended above.
Please can you test with this WSI. We are trying to find a publicly available slide where your code fails so that we can try and reproduce the issue. We dug out a large slide to try and ensure it fails your side.
It crashed like before. Here is the error message.
Brilliant - really appreciate your help with this. We will try and repro our side with this slide :)
@JMBokhorst just to let you know in the README we state to install the libraries as:
conda env create -f environment.yml conda activate hovernet pip install torch==1.6.0 torchvision==0.7.0
The first line uses the
requirements.txt
but we do not install withpip install -r requirements.txt
. We did not add pytorch on purpose and should be installed as recommended above.
Thanks! Did you also include openslide-python to the readme or requirements?
Please can you test with this WSI. We are trying to find a publicly available slide where your code fails so that we can try and reproduce the issue. We dug out a large slide to try and ensure it fails your side.
I will try with this slide and will let you know. Do you want me to test with all multi-proc set to zero or with the default values?
Did you also include openslide-python to the readme or requirements?
conda env create -f environment.yml
will create an environment called hovernet using the environment.yml
file. You will see that openslide is included in the yml
file. The requirements.txt is also called from the .yml
file. We had issues when including PyTorch within the yml
and requirements
, hence why we installed after with pip.
I will try with this slide and will let you know. Do you want me to test with all multi-proc set to zero or with the default values?
Thank you - really do appreciate your help with this. If you could test with both settings and report back that would be super useful.
For this image, I could leave all settings as suggested by the run_wsi.sh
script (so with multi-proc), with exception of the batch-size. Now I'm wondering if the crash on my side has anything to do with the spatial resolution of the image or the image format. The TCGA link contained two SVS files, I stopped it after the script finished with the first image and was halfway done with the second image. Let me know if you want me to run the second image as well.
Below are some stats of the process. The interesting one is Maximum resident set size
as this is the maximum amount of memory used by the process. In this case, it's 38GB.
Hi @JMBokhorst
The link should have only just contained one WSI named:
TCGA-NJ-A4YI-01Z-00-DX1.C4111D01-27BF-486F-8E0E-C8053DB16133.svs
Please can you confirm this? I am not sure exactly why you have found 2 svs files at this link. Posting the link here again, just in case you used the wrong one :)
So from your previous experiment, it seemed that a small WSI that we sent a link to ran successfully. Therefore, potentially you may run out of space as the pred_map.npy
and pred_inst.py
memory maps are being generated. Please can you confirm that you have plenty of space at the location that you have specified as cache
in run_infer.py
. This can be specified using the --cache_dir
argument. Please keep track of the sizes of both pred_map.npy
and pred_inst.py
when running the code and compare that to the space available at the cache location. This week we will be making a few changes to the code to reduce the memory footprint of these two files.
@Mengflz can you please do the same? If it is failing during post processing, then potentially pred_inst.py
which is being generated during post processing doesn't have enough space at cache.
For me, it don't crash during post processing stage, but during process stage. I tried this TCGA WSI again and tracked related file sizes and found they different by two storage view commands. And crashed as usual, the error massage is same.
@Mengflz @JMBokhorst I ran into similar problems and refactored the inference script a bit. You can find my refactored version of the script in #104 . Note that I'm currently awaiting feedback from the project's maintainers on my changes, but it may help you if you are stuck. Happy to hear your feedback as well.
Just to provide my last comment here. After adding #107 , I have just realized that python will pickle the data as file to transfer across parallel processes https://stackoverflow.com/questions/44747145/writing-to-shared-memory-in-python-is-very-slow . Hence the tmp sthg which we saw above. So if anyone have problems, I think remounting /tmp onto volatile RAM or refresh your /tmp folder will clear this up. Will close this down for now.
Hi all,
I have tried to run the Pytorch version after I initially tried with the Tensorflow version. I tried to run the inference script in wsi mode with a ndpi image. It start correct but mid-way through the process I got this error:
Do you know why this error might occur?
Running on an Ubuntu 20 machine that has a conda env with the requirements.