Process hangs in CPU detection step

sjaenick commented 2 years ago

python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset; process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C (needs to be killed instead).

ribodetector_cpu \
  -l 100 -t 10 \
  -e norrna \
  -i ../SRR3569371/SRR3569371_1.fastq ../SRR3569371/SRR3569371_2.fastq \
  -o read1.fq read2.fq

When invoked with python -m trace --trace, it seems to get stuck in the CPU detection step:

detect_cpu.py(71):             cd, self.config['state_file'][model_file_ext]).replace('.pth', '.onnx')
detect_cpu.py(70):         self.model_file = os.path.join(
detect_cpu.py(74):         so = onnxruntime.SessionOptions()
detect_cpu.py(77):         so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
detect_cpu.py(79):         self.model = onnxruntime.InferenceSession(self.model_file, so)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(315):         Session.__init__(self)
 --- modulename: onnxruntime_inference_collection, funcname: __init__
onnxruntime_inference_collection.py(104):         self._sess = None
onnxruntime_inference_collection.py(105):         self._enable_fallback = True
onnxruntime_inference_collection.py(317):         if isinstance(path_or_bytes, str):
onnxruntime_inference_collection.py(318):             self._model_path = path_or_bytes
onnxruntime_inference_collection.py(319):             self._model_bytes = None
onnxruntime_inference_collection.py(326):         self._sess_options = sess_options
onnxruntime_inference_collection.py(327):         self._sess_options_initial = sess_options
onnxruntime_inference_collection.py(328):         self._enable_fallback = True
onnxruntime_inference_collection.py(329):         self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
 --- modulename: _collections_abc, funcname: get
_collections_abc.py(659):         try:
_collections_abc.py(660):             return self[key]
 --- modulename: os, funcname: __getitem__
os.py(671):         try:
os.py(672):             value = self._data[self.encodekey(key)]
 --- modulename: os, funcname: encode
os.py(749):             if not isinstance(value, str):
os.py(751):             return value.encode(encoding, 'surrogateescape')
os.py(673):         except KeyError:
os.py(675):             raise KeyError(key) from None
_collections_abc.py(661):         except KeyError:
_collections_abc.py(662):             return default
onnxruntime_inference_collection.py(332):         disabled_optimizers = kwargs['disabled_optimizers'] if 'disabled_optimizers' in kwargs else None
onnxruntime_inference_collection.py(334):         try:
onnxruntime_inference_collection.py(335):             self._create_inference_session(providers, provider_options, disabled_optimizers)
 --- modulename: onnxruntime_inference_collection, funcname: _create_inference_session
onnxruntime_inference_collection.py(347):         available_providers = C.get_available_providers()
onnxruntime_inference_collection.py(350):         if 'TensorrtExecutionProvider' in available_providers:
onnxruntime_inference_collection.py(353):             self._fallback_providers = ['CPUExecutionProvider']
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
onnxruntime_inference_collection.py(357):                                                                         provider_options,
onnxruntime_inference_collection.py(358):                                                                         available_providers)
onnxruntime_inference_collection.py(356):         providers, provider_options = check_and_normalize_provider_args(providers,
 --- modulename: onnxruntime_inference_collection, funcname: check_and_normalize_provider_args
onnxruntime_inference_collection.py(48):     if providers is None:
onnxruntime_inference_collection.py(49):         return [], []
onnxruntime_inference_collection.py(359):         if providers == [] and len(available_providers) > 1:
onnxruntime_inference_collection.py(366):         session_options = self._sess_options if self._sess_options else C.get_default_session_options()
onnxruntime_inference_collection.py(367):         if self._model_path:
onnxruntime_inference_collection.py(368):             sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)

I added some debugging output to verify self.model_file, which correctly points to the ribodetector_600k_variable_len70_101_epoch47.onnx file.

Any ideas?

sjaenick commented 2 years ago

It seems to be CPU-related somehow:

Hangs on:

model name      : AMD EPYC 7742 64-Core Processor

Seems to work:

model name      : Intel(R) Xeon(R) CPU E5-4627 v4 @ 2.60GHz

dawnmy commented 2 years ago

Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime.

sjaenick commented 2 years ago

Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python).

sjaenick commented 2 years ago

Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue.

dawnmy commented 2 years ago

That is great to hear. Could you run pip list in the environment you installed RiboDetector? Then I can specify the versions of required packages which worked for you when building RiboDetector package for pip.

sjaenick commented 2 years ago

Not a virtual environment, so the list is a little bit longer..

Package                          Version
-------------------------------- -------------------
absl-py                          0.13.0
antismash                        5.1.2
argcomplete                      1.12.3
argh                             0.26.2
astunparse                       1.6.3
bagit                            1.7.0
bcbio-gff                        0.6.6
bertax                           0.1
biom-format                      2.1.8
biopython                        1.78
BUSCO                            5.2.2
CacheControl                     0.11.7
cachetools                       4.2.2
certifi                          2020.12.5
chardet                          4.0.0
checkm-genome                    1.1.3
click                            7.1.2
CMSeq                            1.0.1
coloredlogs                      15.0
concoct                          1.1.0
cwltool                          3.0.20201203173111
cycler                           0.10.0
Cython                           0.29.21
decorator                        4.4.2
DendroPy                         4.4.0
flatbuffers                      2.0
flye                             2.8.3
future                           0.18.2
gast                             0.4.0
gffutils                         0.10.1
google-auth                      1.32.1
google-auth-oauthlib             0.4.4
google-pasta                     0.2.0
grpcio                           1.34.1
h5py                             3.1.0
helperlibs                       0.2.1
humanfriendly                    9.1
idna                             2.10
isodate                          0.6.0
Jinja2                           2.11.2
joblib                           0.16.0
Keras                            2.4.3
keras-bert                       0.88.0
keras-embed-sim                  0.9.0
keras-layer-normalization        0.15.0
keras-multi-head                 0.28.0
keras-nightly                    2.5.0.dev2021032900
keras-pos-embd                   0.12.0
keras-position-wise-feed-forward 0.7.0
Keras-Preprocessing              1.1.2
keras-self-attention             0.50.0
keras-transformer                0.39.0
kiwisolver                       1.2.0
lockfile                         0.12.2
lxml                             4.6.2
Markdown                         3.3.4
MarkupSafe                       2.0.1
matplotlib                       3.3.1
MetaPhlAn                        3.0
mistune                          0.8.4
mypy-extensions                  0.4.3
networkx                         2.5
nose                             1.3.7
numpy                            1.22.2
oauthlib                         3.1.1
onnxruntime                      1.10.0
opt-einsum                       3.3.0
pandas                           1.1.2
PhyloPhlAn                       3.0.0
Pillow                           7.2.0
pip                              22.0.3
protobuf                         3.19.4
prov                             1.5.1
psutil                           5.8.0
pyasn1                           0.4.8
pyasn1-modules                   0.2.8
pydot                            1.4.1
pyfaidx                          0.6.2
pyparsing                        2.4.7
pysam                            0.16.0.1
pyScss                           1.3.7
pysvg-py3                        0.2.2.post3
python-dateutil                  2.8.1
python-igraph                    0.9.7
pytz                             2020.1
PyYAML                           5.4.1
rdflib                           4.2.2
rdflib-jsonld                    0.5.0
requests                         2.25.1
requests-oauthlib                1.3.0
ribodetector                     0.2.3
rsa                              4.7.2
ruamel.yaml                      0.16.5
schema-salad                     7.0.20201119201711
scikit-learn                     0.23.2
scipy                            1.5.2
seaborn                          0.11.0
sepp                             4.5.1
setuptools                       51.1.1
shellescape                      3.4.1
simplejson                       3.17.5
six                              1.15.0
tensorboard                      2.5.0
tensorboard-data-server          0.6.1
tensorboard-plugin-wit           1.8.0
tensorflow                       2.5.0
tensorflow-estimator             2.5.0
termcolor                        1.1.0
texttable                        1.6.4
threadpoolctl                    2.1.0
torch                            1.7.1
tqdm                             4.62.3
typing-extensions                3.7.4.3
urllib3                          1.26.2
Werkzeug                         2.0.1
wheel                            0.35.1
wrapt                            1.12.1

dawnmy commented 2 years ago

Thank you for sharing the package version list. will update the package soon

dawnmy commented 2 years ago

Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:

conda create -n ribodetector_0.2.4 python=3.8
conda activate ribodetector_0.2.4
git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
pip install .

Hope this update will work without any issue.

dawnmy commented 2 years ago

I will close this issue as it seems to be solved.

dawnmy commented 2 years ago

@sjaenick Could you provide more details about how did you solved this issue? It seems the other open issue #9 is related to this one. The multiprocessing in CPU mode has a compatibility issue with SLURM which hangs/freezes the process.

sjaenick commented 2 years ago

Nothing but pip3 install --force-reinstall onnxruntime

karl-az commented 2 years ago

Hi I have a similar issue (also related to onnxruntime), but it results in another error:

Traceback (most recent call last):
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/bin/ribodetector_cpu", line 10, in <module>
    sys.exit(main())
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 526, in main
    seq_pred.load_model()
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 79, in load_model
    self.model = onnxruntime.InferenceSession(self.model_file, so)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /home/conda/feedstock_root/build_artifacts/onnxruntime_1639384799973/work/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:

Let me know if this goes into a separate issue.

I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2).

The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel.

I found this https://github.com/microsoft/onnxruntime/issues/8313, hinting towards that at least my error could have something to do with sleeping CPUs.

dawnmy commented 2 years ago

@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: https://github.com/microsoft/onnxruntime/issues/10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo).

sjaenick commented 2 years ago

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

dawnmy commented 2 years ago

For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.

Yes, I tried this as well with srun interactive job and the job got frozen. If you run it independently from SLURM, i.e. directly in ssh, everything is fine. I assume this SLURM related issue only present in the CPU mode, GPU mode should be fine.

karl-az commented 2 years ago

I'm able to resolve this by uncommenting this row: https://github.com/hzi-bifo/RiboDetector/blob/a49a05400ea5103fcd4781de3d967b514ce60654/ribodetector/detect_cpu.py#L75 I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed)

As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs.

dawnmy commented 2 years ago

@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in https://github.com/microsoft/onnxruntime/issues/10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs (-t) I specified. Could you check the total CPU load?

karl-az commented 2 years ago

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

dawnmy commented 2 years ago

I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?

Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set --cpus-per-task, it will use the default number of CPUs preconfigured by admin. If -t was set larger the the default slurm --cpus-per-task, the CPU load will be lower than expected.

So if you run ribodetector with SLURM, you should set cpus-per-task to what you want. e.g.: for the interactive mode, start the session with srun --qos interactive --cpus-per-task {number of CPUs you need} --threads-per-core 1 --pty /bin/bash Then run ribodetector_cpu -t {number of CPUs you need} .... The current version of ribodetector needs to be updated i.e. change intra_op_num_threads = 1. I will update the repo soon. If it is urgent, you can just modify intra_op_num_threads by yourself.

dawnmy commented 2 years ago

The latest release v0.2.4 solved this issue. Please update to v0.2.4 with:

pip install ribodetector -U

karl-az commented 2 years ago

Thank you very much! I will pick it up when the release reaches bioconda.

dawnmy commented 2 years ago

It is available on bioconda now.

karl-az commented 2 years ago

Can confirm that it works for me. Thank you!

hzi-bifo / RiboDetector

Process hangs in CPU detection step #6