Process hangs in CPU detection step #6

Closed sjaenick closed 2 years ago

sjaenick commented 2 years ago

python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset; process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C (needs to be killed instead).

ribodetector_cpu \
  -l 100 -t 10 \
  -e norrna \
  -i ../SRR3569371/SRR3569371_1.fastq ../SRR3569371/SRR3569371_2.fastq \
  -o read1.fq read2.fq

When invoked with python -m trace --trace, it seems to get stuck in the CPU detection step:

detect_cpu.py(71):             cd, self.config['state_file'][model_file_ext]).replace('.pth', '.onnx')
detect_cpu.py(70):         self.model_file = os.path.join(
detect_cpu.py(74):         so = onnxruntime.SessionOptions()
detect_cpu.py(77):         so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
detect_cpu.py(79):         self.model = onnxruntime.InferenceSession(self.model_file, so)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(315):         Session.__init__(self)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(104):         self._sess = None
onnxruntime_inference_collection.py(105):         self._enable_fallback = True
onnxruntime_inference_collection.py(317):         if isinstance(path_or_bytes, str):
onnxruntime_inference_collection.py(318):             self._model_path = path_or_bytes
onnxruntime_inference_collection.py(319):             self._model_bytes = None
onnxruntime_inference_collection.py(326):         self._sess_options = sess_options
onnxruntime_inference_collection.py(327):         self._sess_options_initial = sess_options
onnxruntime_inference_collection.py(328):         self._enable_fallback = True
onnxruntime_inference_collection.py(329):         self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
 --- modulename: _collections_abc, funcname: get
_collections_abc.py(659):         try:
_collections_abc.py(660):             return self[key]
 --- modulename: os, funcname: __getitem__
os.py(671):         try:
os.py(672):             value = self._data[self.encodekey(key)]
 --- modulename: os, funcname: encode
os.py(749):             if not isinstance(value, str):
os.py(751):             return value.encode(encoding, 'surrogateescape')
os.py(673):         except KeyError:
os.py(675):             raise KeyError(key) from None
_collections_abc.py(661):         except KeyError:
_collections_abc.py(662):             return default
onnxruntime_inference_collection.py(332):         disabled_optimizers = kwargs['disabled_optimizers'] if 'disabled_optimizers' in kwargs else None
onnxruntime_inference_collection.py(334):         try:
onnxruntime_inference_collection.py(335):             self._create_inference_session(providers, provider_options, disabled_optimizers)
 --- modulename: onnxruntime_inference_collection, funcname: _create_inference_session
onnxruntime_inference_collection.py(347):         available_providers = C.get_available_providers()
onnxruntime_inference_collection.py(350):         if 'TensorrtExecutionProvider' in available_providers:
onnxruntime_inference_collection.py(353):             self._fallback_providers = ['CPUExecutionProvider']
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
onnxruntime_inference_collection.py(357):                                                                         provider_options,
onnxruntime_inference_collection.py(358):                                                                         available_providers)
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
 --- modulename: onnxruntime_inference_collection, funcname: check_and_normalize_provider_args
onnxruntime_inference_collection.py(48):     if providers is None:
onnxruntime_inference_collection.py(49):         return [], []
onnxruntime_inference_collection.py(359):         if providers == [] and len(available_providers) > 1:
onnxruntime_inference_collection.py(366):         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
onnxruntime_inference_collection.py(367):         if self._model_path:
onnxruntime_inference_collection.py(368):             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)

I added some debugging output to verify self.model_file, which correctly points to the ribodetector_600k_variable_len70_101_epoch47.onnx file.

Any ideas?

sjaenick commented 2 years ago

It seems to be CPU-related somehow:

Hangs on:

model name      : AMD EPYC 7742 64-Core Processor

Seems to work:

model name      : Intel(R) Xeon(R) CPU E5-4627 v4 @ 2.60GHz
dawnmy commented 2 years ago

Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime.

sjaenick commented 2 years ago

Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python).

sjaenick commented 2 years ago

Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue.

dawnmy commented 2 years ago

That is great to hear. Could you run pip list in the environment you installed RiboDetector? Then I can specify the versions of required packages which worked for you when building RiboDetector package for pip.

sjaenick commented 2 years ago

Not a virtual environment, so the list is a little bit longer..

dawnmy commented 2 years ago

Thank you for sharing the package version list. will update the package soon

dawnmy commented 2 years ago

Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:

conda create -n ribodetector_0.2.4 python=3.8
conda activate ribodetector_0.2.4
git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
pip install .

Hope this update will work without any issue.

dawnmy commented 2 years ago

I will close this issue as it seems to be solved.

dawnmy commented 2 years ago

@sjaenick Could you provide more details about how did you solved this issue? It seems the other open issue #9 is related to this one. The multiprocessing in CPU mode has a compatibility issue with SLURM which hangs/freezes the process.

sjaenick commented 2 years ago

Nothing but pip3 install --force-reinstall onnxruntime

karl-az commented 2 years ago

Hi I have a similar issue (also related to onnxruntime), but it results in another error:

Traceback (most recent call last):
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/bin/ribodetector_cpu", line 10, in <module>
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 526, in main
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 79, in load_model
    self.model = onnxruntime.InferenceSession(self.model_file, so)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /home/conda/feedstock_root/build_artifacts/onnxruntime_1639384799973/work/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:

Let me know if this goes into a separate issue.

I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2).

The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel.

I found this https://github.com/microsoft/onnxruntime/issues/8313, hinting towards that at least my error could have something to do with sleeping CPUs.

dawnmy commented 2 years ago

@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: https://github.com/microsoft/onnxruntime/issues/10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo).

sjaenick commented 2 years ago

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

dawnmy commented 2 years ago

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

Yes, I tried this as well with srun interactive job and the job got frozen. If you run it independently from SLURM, i.e. directly in ssh, everything is fine. I assume this SLURM related issue only present in the CPU mode, GPU mode should be fine.

karl-az commented 2 years ago

I'm able to resolve this by uncommenting this row: https://github.com/hzi-bifo/RiboDetector/blob/a49a05400ea5103fcd4781de3d967b514ce60654/ribodetector/detect_cpu.py#L75 I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed)

As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs.

dawnmy commented 2 years ago

@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in https://github.com/microsoft/onnxruntime/issues/10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs (-t) I specified. Could you check the total CPU load?

karl-az commented 2 years ago

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

dawnmy commented 2 years ago

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set --cpus-per-task, it will use the default number of CPUs preconfigured by admin. If -t was set larger the the default slurm --cpus-per-task, the CPU load will be lower than expected.

So if you run ribodetector with SLURM, you should set cpus-per-task to what you want. e.g.: for the interactive mode, start the session with srun --qos interactive --cpus-per-task {number of CPUs you need} --threads-per-core 1 --pty /bin/bash Then run ribodetector_cpu -t {number of CPUs you need} .... The current version of ribodetector needs to be updated i.e. change intra_op_num_threads = 1. I will update the repo soon. If it is urgent, you can just modify intra_op_num_threads by yourself.

dawnmy commented 2 years ago

The latest release v0.2.4 solved this issue. Please update to v0.2.4 with:

pip install ribodetector -U
karl-az commented 2 years ago

Thank you very much! I will pick it up when the release reaches bioconda.

dawnmy commented 2 years ago

It is available on bioconda now.

karl-az commented 2 years ago

Can confirm that it works for me. Thank you!