Run inference script crashes

JMBokhorst commented 3 years ago

Hi all,

I have tried to run the Pytorch version after I initially tried with the Tensorflow version. I tried to run the inference script in wsi mode with a ndpi image. It start correct but mid-way through the process I got this error:

Process Chunk 48/99:  61%|#############5        | 35/57 [02:19<01:11,  3.23s/it]|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
    self.process_single_file(wsi_path, msk_path, self.output_dir)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
    self.__get_raw_prediction(chunk_info_list, patch_info_list)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
    chunk_patch_info_list[:, 0, 0], pbar_desc
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
    for batch_idx, batch_data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
Process Chunk 48/99:  61%|#############5        | 35/57 [02:19<01:27,  4.00s/it]
/usr/local/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Do you know why this error might occur?

Running on an Ubuntu 20 machine that has a conda env with the requirements.

vqdang commented 3 years ago

@JMBokhorst This is certainly very new to me. Can you detail the running settings ? At the moment, can you also check the number of files opened by the processes, or #hanging processes, as it seems the error propagated from the OS.

OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'

There is an output log file too, so please also attach for reference.

simongraham commented 3 years ago

@JMBokhorst

Can you confirm that you are running one script or are you processing multiple WSIs in parallel?

vqdang commented 3 years ago

@JMBokhorst could you pull down the PR and check if that fixes the issue ?

JMBokhorst commented 3 years ago

@simongraham and @vqdang,

Thanks for the quick response. I will check out the PR now and see if it fixes the issue.

I'm trying to run the script on a folder containing a single ndpi image. I use this command (based on the run_wsi.sh script):

python3.7 run_infer.py --gpu='0,1' --nr_types=6 --type_info_path=type_info.json --batch_size=64 --model_mode=fast --model_path=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hovernet_fast_pannuke_type_tf2pytorch.tar --nr_inference_workers=8 --nr_post_proc_workers=16 wsi --input_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/test_image/ --output_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/result/ --save_thumb --save_mask

Below is the output of the debug log, is this the log file you are referring to?

|2021-01-06|11:46:06.636| [INFO] ................ Process: TB_S02_P005_C0001_L15_A15
|2021-01-06|11:46:11.858| [INFO] ................ WARNING: No mask found, generating mask via thresholding at 1.25x!
|2021-01-06|11:46:23.762| [INFO] ........ Preparing Input Output Placement: 17.12366568017751
|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
    self.process_single_file(wsi_path, msk_path, self.output_dir)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
    self.__get_raw_prediction(chunk_info_list, patch_info_list)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
    chunk_patch_info_list[:, 0, 0], pbar_desc
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
    for batch_idx, batch_data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

JMBokhorst commented 3 years ago

Unfortunately, I get the same error with the PR. Since it is happening at the same point, I have put a breakpoint at the point it crashes. I will let you know when I have more information.

vqdang commented 3 years ago

@JMBokhorst Could you try this ? Also, what is the size of your WSI ?

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

JMBokhorst commented 3 years ago

@vqdang, yes going to try it now.

The image has these dimensions (x, y): 86016, 105728

JMBokhorst commented 3 years ago

That also didn't fix the issue, I'm not trying with 0 inference workers to disable multiple processes, see if I get a more clear error

JMBokhorst commented 3 years ago

I have set the inference workers to 0, I don't get an error during processing the slide. However, the script hangs as soon as it is done processing the final chunk; no error, no process, nothing.

The error does seem to be related to the inference workers. I will play a bit with the settings and report back :)

vqdang commented 3 years ago

Are you on the windows or linux ? Could you check also check with --nr_post_proc_workers 0 to turn off multithread for post proc ?

https://github.com/vqdang/hover_net/blob/4978aa5e578c2e32982a3d54197270360630dc4a/infer/wsi.py#L529-L534

You can replace this portion to read the cached memmap and comment this line out to prevent rerunning the prediction to speed up the process.

https://github.com/vqdang/hover_net/blob/4978aa5e578c2e32982a3d54197270360630dc4a/infer/wsi.py#L550

JMBokhorst commented 3 years ago

I'm running on linux. I re-created the conda env. Now with --nr_post_proc_workers set to 0 it seems to be working when I just have a very small mask (only 1 tile big). When I use the entire tissue mask I'm running in to some memory issue's. It is always fine until the final part of phase 1. Could you guys tell me how much memory it should use +/-? I have 32GB of memory but maybe that isn't enough.

vqdang commented 3 years ago

It is always fine until the final part of phase 1.

Rather than memory issue, this may due to some dead thread making the processes hanging. Can you supply the log ?

And for the memory, because you are using linux, you can adding more swap to avoid OOM problem. Our internal testing done upto 100k x 100k on a system with 128GB ram and 128GB swap. The memory usage will ofc scale with the WSI size, but it should not affect much at the post proc phase. The memory at the post proc scale with the #worker mostly because we copy a tile (should be small, by default its 2048) https://github.com/vqdang/hover_net/blob/08456787d033d4bb1deff478313a8e305805845d/run_infer.py#L66 from mmap back to ram per worker. (1 worker keep memory for 1 tile, 8 workers keep 8). Could you use a TCGA sample for testing so that our side can replicate any further problems?

The inference phase should use more memory, and also may consume lots of hard drive space depending on WSI sizes.

JMBokhorst commented 3 years ago

I will try with a slide from TCGA, if you want I can try with one of the slide that you used as well?

I have added the debug log below, it is killed during runtime and at that point I saw only 200MB memory left. I see the memory starts with <8GB but it is steadily increasing when the first phase is almost complete. I will try to increase the SWAP area. The image is relative big 100K X 200K, so that might be an issue.

|2021-01-15|15:13:38.395| [INFO] ........ Preparing Input Output Placement: 7.505539770999803
|2021-01-15|15:15:57.582| [ERROR] Crash
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/local/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1022) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 746, in process_wsi_list
    self.process_single_file(wsi_path, msk_path, self.output_dir)
  File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 550, in process_single_file
    self.__get_raw_prediction(chunk_info_list, patch_info_list)
  File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 373, in __get_raw_prediction
    patch_output_list = self.__run_model(
  File "/mnt/maindisk/phd/hover/hover_net/infer/wsi.py", line 287, in __run_model
    for batch_idx, batch_data in enumerate(dataloader):
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1022) exited unexpectedly

simongraham commented 3 years ago

Just to confirm, can you also say whether you are using the exact environment instructions that we specify on the README?

We will send a couple of links to TCGA slides to test. One big and one small.

Thanks for hanging in there - I’m sure we will sort it ASAP 😊

JMBokhorst commented 3 years ago

Yes, or I created a new conda environment and installed the pip requirement file. In addition I had to install pytorch and openslide-python.

I sourced a computer with more memory (64GB) and there the post-processing also seems to work correctly and finish without any issue's :) Let me know if I can help by running it on more slides!

Thanks for the help and quick responses! :)

JMBokhorst commented 3 years ago

PS, I saw that you added the MRXS support to the WSI-file handler thanks!

I only noticed that I also needed to change line 732 of infer/wsi.py file for the MRXS support: old: wsi_path_list = glob.glob(self.input_dir + "/*") new: wsi_path_list = glob.glob(self.input_dir)

Now I call the script with /path/to/images/*.*. Needed to do this so the image folder of the MRXS file isn't add to the process list separately.

simongraham commented 3 years ago

Please try testing with this WSI - it is quite small

JMBokhorst commented 3 years ago

I Will try it out now, I do a test with and without multi-proc and keep an eye out for the mem usage.

JMBokhorst commented 3 years ago

That seems to run correctly, with and without multi-proc. With multi-proc the mem usage is roughly 10GB in total, without +/- 2.5GB.

simongraham commented 3 years ago

Okay thanks for this - we will look into it this week and get back to you.

Mengflz commented 3 years ago

I met the same question. When I executed run_infer.py, it throw RuntimeError at the same place, the error message are just like above. When I set --nr_inference_workers=0, the error message disappeared but it takes long time to process a wsi file and run_infer.py seems to take lots of memory.

simongraham commented 3 years ago

Thanks @Mengflz - looking into it

simongraham commented 3 years ago

@JMBokhorst @Mengflz

Can you please confirm that inference is okay and it is always post processing that crashes.

JMBokhorst commented 3 years ago

Initially, I had some issues with inference itself, but that was solved after I re-created the environment. After the update inference always ran well.

Post-processing only crashes when I didn't have enough memory. Reducing the number of post_proc_workers or increasing the memory (physical), solves the issue.

PS. I did had to add pytorch and openslide-python to the requirements.

Mengflz commented 3 years ago

@JMBokhorst @Mengflz

Can you please confirm that inference is okay and it is always post processing that crashes.

For me, it is not post processing stage but inference stage that crashes. After I set nr_inference_workers=0, this problem solved.

vqdang commented 3 years ago

@Mengflz @JMBokhorst Can you guys share with us your system specs?

PS. I did had to add pytorch and openslide-python to the requirements.

So your initial manual installation for pytorch didn't work ?

@Mengflz Did you get OOM in the crash log, or out of file pointers like JMBokhorst ?

JMBokhorst commented 3 years ago

@Mengflz @JMBokhorst Can you guys share with us your system specs?

CPU: I7-7700k MEM: 32GB GPU: 1080Ti HDD: 500GB SSD OS: Ubuntu 20.04

JMBokhorst commented 3 years ago

@Mengflz @JMBokhorst Can you guys share with us your system specs?

PS. I did had to add pytorch and openslide-python to the requirements.

So your initial manual installation for pytorch didn't work ?

No I didn't have a manual installation. I created a new conda environment and ran pip install -r requirements.txt but that doesn't install pytorch, so it crashed until I added pytorch to the requirements. Just wanted to let you know it those packages are missing from the requirements :)

Mengflz commented 3 years ago

I run this script on the server CPU：Intel(R) Xeon(R) Platinum 8165 CPU @ 2.30GHz MEM: 378G GPU：Tesla K80 OS：Ubuntu 18.04.2

I met crash out of file pointers. Error message on the terminal just like "RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using ulimit -n in the shell or change the sharing strategy by calling torch.multiprocessing.set_sharing_strategy('file_system') at the beginning of your code"

simongraham commented 3 years ago

@JMBokhorst just to let you know in the README we state to install the libraries as:

conda env create -f environment.yml
conda activate hovernet
pip install torch==1.6.0 torchvision==0.7.0

The first line uses the requirements.txt but we do not install with pip install -r requirements.txt. We did not add pytorch on purpose and should be installed as recommended above.

simongraham commented 3 years ago

Please can you test with this WSI. We are trying to find a publicly available slide where your code fails so that we can try and reproduce the issue. We dug out a large slide to try and ensure it fails your side.

Mengflz commented 3 years ago

It crashed like before. Here is the error message.

simongraham commented 3 years ago

Brilliant - really appreciate your help with this. We will try and repro our side with this slide :)

JMBokhorst commented 3 years ago

@JMBokhorst just to let you know in the README we state to install the libraries as:
conda env create -f environment.yml
conda activate hovernet
pip install torch==1.6.0 torchvision==0.7.0
The first line uses the requirements.txt but we do not install with pip install -r requirements.txt. We did not add pytorch on purpose and should be installed as recommended above.

Thanks! Did you also include openslide-python to the readme or requirements?

Please can you test with this WSI. We are trying to find a publicly available slide where your code fails so that we can try and reproduce the issue. We dug out a large slide to try and ensure it fails your side.

I will try with this slide and will let you know. Do you want me to test with all multi-proc set to zero or with the default values?

simongraham commented 3 years ago

Did you also include openslide-python to the readme or requirements?

conda env create -f environment.yml will create an environment called hovernet using the environment.yml file. You will see that openslide is included in the yml file. The requirements.txt is also called from the .yml file. We had issues when including PyTorch within the ymland requirements, hence why we installed after with pip.

I will try with this slide and will let you know. Do you want me to test with all multi-proc set to zero or with the default values?

Thank you - really do appreciate your help with this. If you could test with both settings and report back that would be super useful.

JMBokhorst commented 3 years ago

For this image, I could leave all settings as suggested by the run_wsi.sh script (so with multi-proc), with exception of the batch-size. Now I'm wondering if the crash on my side has anything to do with the spatial resolution of the image or the image format. The TCGA link contained two SVS files, I stopped it after the script finished with the first image and was halfway done with the second image. Let me know if you want me to run the second image as well.

Below are some stats of the process. The interesting one is Maximum resident set size as this is the maximum amount of memory used by the process. In this case, it's 38GB.

Screenshot 2021-02-01 at 21 49 23

simongraham commented 3 years ago

Hi @JMBokhorst

The link should have only just contained one WSI named:

TCGA-NJ-A4YI-01Z-00-DX1.C4111D01-27BF-486F-8E0E-C8053DB16133.svs

Please can you confirm this? I am not sure exactly why you have found 2 svs files at this link. Posting the link here again, just in case you used the wrong one :)

So from your previous experiment, it seemed that a small WSI that we sent a link to ran successfully. Therefore, potentially you may run out of space as the pred_map.npy and pred_inst.py memory maps are being generated. Please can you confirm that you have plenty of space at the location that you have specified as cache in run_infer.py. This can be specified using the --cache_dir argument. Please keep track of the sizes of both pred_map.npy and pred_inst.py when running the code and compare that to the space available at the cache location. This week we will be making a few changes to the code to reduce the memory footprint of these two files.

@Mengflz can you please do the same? If it is failing during post processing, then potentially pred_inst.py which is being generated during post processing doesn't have enough space at cache.

Mengflz commented 3 years ago

For me, it don't crash during post processing stage, but during process stage. I tried this TCGA WSI again and tracked related file sizes and found they different by two storage view commands. github_issue3 github_issue2 And crashed as usual, the error massage is same.

jjhbw commented 3 years ago

@Mengflz @JMBokhorst I ran into similar problems and refactored the inference script a bit. You can find my refactored version of the script in #104 . Note that I'm currently awaiting feedback from the project's maintainers on my changes, but it may help you if you are stuck. Happy to hear your feedback as well.

vqdang commented 3 years ago

Just to provide my last comment here. After adding #107 , I have just realized that python will pickle the data as file to transfer across parallel processes https://stackoverflow.com/questions/44747145/writing-to-shared-memory-in-python-is-very-slow . Hence the tmp sthg which we saw above. So if anyone have problems, I think remounting /tmp onto volatile RAM or refresh your /tmp folder will clear this up. Will close this down for now.

vqdang / hover_net

Run inference script crashes #79