Closed sjaenick closed 2 years ago
It seems to be CPU-related somehow:
Hangs on:
model name : AMD EPYC 7742 64-Core Processor
Seems to work:
model name : Intel(R) Xeon(R) CPU E5-4627 v4 @ 2.60GHz
Thank you for reporting this issue. This is weird. I have tested it on my workstation with AMD Ryzen without any issue. AMD EPYC and Ryzen both use the Zen microarchitecture. Will investigate whether this is a bug in onnxruntime.
Thanks - let me know if I can do anything to narrow this down (but be aware I barely know any Python).
Ok, I reinstalled onnxruntime via pip (which also updated some other packages) and now it works. Feel free to close this issue.
That is great to hear. Could you run pip list
in the environment you installed RiboDetector? Then I can specify the versions of required packages which worked for you when building RiboDetector package for pip.
Not a virtual environment, so the list is a little bit longer..
Package Version
-------------------------------- -------------------
absl-py 0.13.0
antismash 5.1.2
argcomplete 1.12.3
argh 0.26.2
astunparse 1.6.3
bagit 1.7.0
bcbio-gff 0.6.6
bertax 0.1
biom-format 2.1.8
biopython 1.78
BUSCO 5.2.2
CacheControl 0.11.7
cachetools 4.2.2
certifi 2020.12.5
chardet 4.0.0
checkm-genome 1.1.3
click 7.1.2
CMSeq 1.0.1
coloredlogs 15.0
concoct 1.1.0
cwltool 3.0.20201203173111
cycler 0.10.0
Cython 0.29.21
decorator 4.4.2
DendroPy 4.4.0
flatbuffers 2.0
flye 2.8.3
future 0.18.2
gast 0.4.0
gffutils 0.10.1
google-auth 1.32.1
google-auth-oauthlib 0.4.4
google-pasta 0.2.0
grpcio 1.34.1
h5py 3.1.0
helperlibs 0.2.1
humanfriendly 9.1
idna 2.10
isodate 0.6.0
Jinja2 2.11.2
joblib 0.16.0
Keras 2.4.3
keras-bert 0.88.0
keras-embed-sim 0.9.0
keras-layer-normalization 0.15.0
keras-multi-head 0.28.0
keras-nightly 2.5.0.dev2021032900
keras-pos-embd 0.12.0
keras-position-wise-feed-forward 0.7.0
Keras-Preprocessing 1.1.2
keras-self-attention 0.50.0
keras-transformer 0.39.0
kiwisolver 1.2.0
lockfile 0.12.2
lxml 4.6.2
Markdown 3.3.4
MarkupSafe 2.0.1
matplotlib 3.3.1
MetaPhlAn 3.0
mistune 0.8.4
mypy-extensions 0.4.3
networkx 2.5
nose 1.3.7
numpy 1.22.2
oauthlib 3.1.1
onnxruntime 1.10.0
opt-einsum 3.3.0
pandas 1.1.2
PhyloPhlAn 3.0.0
Pillow 7.2.0
pip 22.0.3
protobuf 3.19.4
prov 1.5.1
psutil 5.8.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydot 1.4.1
pyfaidx 0.6.2
pyparsing 2.4.7
pysam 0.16.0.1
pyScss 1.3.7
pysvg-py3 0.2.2.post3
python-dateutil 2.8.1
python-igraph 0.9.7
pytz 2020.1
PyYAML 5.4.1
rdflib 4.2.2
rdflib-jsonld 0.5.0
requests 2.25.1
requests-oauthlib 1.3.0
ribodetector 0.2.3
rsa 4.7.2
ruamel.yaml 0.16.5
schema-salad 7.0.20201119201711
scikit-learn 0.23.2
scipy 1.5.2
seaborn 0.11.0
sepp 4.5.1
setuptools 51.1.1
shellescape 3.4.1
simplejson 3.17.5
six 1.15.0
tensorboard 2.5.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow 2.5.0
tensorflow-estimator 2.5.0
termcolor 1.1.0
texttable 1.6.4
threadpoolctl 2.1.0
torch 1.7.1
tqdm 4.62.3
typing-extensions 3.7.4.3
urllib3 1.26.2
Werkzeug 2.0.1
wheel 0.35.1
wrapt 1.12.1
Thank you for sharing the package version list. will update the package soon
Hi. I updated the dependency versions in the repo but I haven't updated it in pip. You can install it with:
conda create -n ribodetector_0.2.4 python=3.8
conda activate ribodetector_0.2.4
git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
pip install .
Hope this update will work without any issue.
I will close this issue as it seems to be solved.
@sjaenick Could you provide more details about how did you solved this issue? It seems the other open issue #9 is related to this one. The multiprocessing in CPU mode has a compatibility issue with SLURM which hangs/freezes the process.
Nothing but pip3 install --force-reinstall onnxruntime
Hi I have a similar issue (also related to onnxruntime), but it results in another error:
Traceback (most recent call last):
File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/bin/ribodetector_cpu", line 10, in <module>
sys.exit(main())
File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 526, in main
seq_pred.load_model()
File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/ribodetector/detect_cpu.py", line 79, in load_model
self.model = onnxruntime.InferenceSession(self.model_file, so)
File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/scratch/test/nextflow_work/conda/main-797bbd9e1c938f49c3dcfd3e78f623bf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 368, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /home/conda/feedstock_root/build_artifacts/onnxruntime_1639384799973/work/onnxruntime/core/platform/posix/env.cc:183 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg:
Let me know if this goes into a separate issue.
I installed version 0.2.3 through bioconda together with 1.10.0 of onnxruntime (build py39h15e0acf_2).
The command works when I run it locally, but fails when submitting it to SLURM. For me it is also different CPUs, but both are from Intel.
I found this https://github.com/microsoft/onnxruntime/issues/8313, hinting towards that at least my error could have something to do with sleeping CPUs.
@karl-az I think these including #9 are all the same issue related to onnxruntume SLURM compatibility. RiboDetector works fine with other task manage system e.g. PBS and SGE. I have opened an issue in onnxruntime repo few days ago: https://github.com/microsoft/onnxruntime/issues/10736. I hope I can get some clue there. I can update the code to let RiboDetector run on SLURM but only 2 CPU can be fully utilized (details can be seen in the issue in onnxruntime repo).
For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.
For completeness, I can confirm that the run that just got stuck also was performed within an (interactive) SLURM job.
Yes, I tried this as well with srun
interactive job and the job got frozen. If you run it independently from SLURM, i.e. directly in ssh, everything is fine. I assume this SLURM related issue only present in the CPU mode, GPU mode should be fine.
I'm able to resolve this by uncommenting this row: https://github.com/hzi-bifo/RiboDetector/blob/a49a05400ea5103fcd4781de3d967b514ce60654/ribodetector/detect_cpu.py#L75 I tried with either 1 or 2 for this setting and both executes nicely through SLURM, utilizing 5 cores. (L76 is not needed)
As I understand the definition of intra_op_num_threads (http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html) it defines the number of threads per worker. It defaults to 0, which allows onnxruntime to auto-detect. My suspicion is that the auto-detect somehow steps outside the SLURM "sandbox", trying to use unreserved cores and that comes across as trying to activate "sleeping" CPUs.
@karl-az Thank you for sharing you solution. Yes, you are right. What you assumed "auto-detect somehow steps outside the SLURM "sandbox"" is very reasonable. I agree with this. I have also tried to set a non-zero value to intra_op_num_threads, details can be found in https://github.com/microsoft/onnxruntime/issues/10736. However the total CPU load (sum of all processes) was only 200% no matter how many CPUs (-t
) I specified. Could you check the total CPU load?
I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?
I see.... Running it locally and checking with htop, I see lower CPU utilization for the different workers: 5x60%, 7x45, and 12x30%. This is with intra_op_num_threads = 1 and lands at roughly 300%. Could the process be bound by something else?
Now, I figured out why it used only 200% or 300% CPU. If you run a slurm task without set --cpus-per-task
, it will use the default number of CPUs preconfigured by admin. If -t
was set larger the the default slurm --cpus-per-task
, the CPU load will be lower than expected.
So if you run ribodetector
with SLURM, you should set cpus-per-task to what you want. e.g.:
for the interactive mode, start the session with
srun --qos interactive --cpus-per-task {number of CPUs you need} --threads-per-core 1 --pty /bin/bash
Then run ribodetector_cpu -t {number of CPUs you need} ...
. The current version of ribodetector needs to be updated i.e. change intra_op_num_threads = 1. I will update the repo soon. If it is urgent, you can just modify intra_op_num_threads
by yourself.
The latest release v0.2.4
solved this issue. Please update to v0.2.4 with:
pip install ribodetector -U
Thank you very much! I will pick it up when the release reaches bioconda.
It is available on bioconda now.
Can confirm that it works for me. Thank you!
python 3.8.5, ribodetector 0.2.3 (installed via pip), on Ubuntu 20.04 LTS, invoked on a public dataset; process just hangs after a few seconds, no CPU consumption at all, and can't be cancelled via Ctrl-C (needs to be killed instead).
When invoked with
python -m trace --trace
, it seems to get stuck in the CPU detection step:I added some debugging output to verify
self.model_file
, which correctly points to theribodetector_600k_variable_len70_101_epoch47.onnx
file.Any ideas?