Closed zixianwang2022 closed 1 week ago
Update: I tried removing all my cache and reran everything. However, I am still seeing the same
CM error: Conflicting branches between version assigned and user specified.!
Update: I tried updating my conda python env to 3.11.7
, matching system-wide's python env. I removed all the CM cache, but still getting the same error.
Hi @zixianwang2022 Can you please do cm pull repo
and retry? We had to use the dev` branch of inference-src temporarily as for SCC24 we need to dump 10 images during the accuracy run.
Hello @arjunsuresh , We are seeing a couple more different errors when switching to dev
branch.
We ran
cm rm cache --tags=inference,src -f
Then we ran to give us performance estimation. We are not seeing any errors from below.
# Performance Estimation for Offline Scenario
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet \
--precision=float16
Then we we ran the actual performance test from the following command
# Formal run
# Official Implementation
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet --precision=float16
We are seeing the following error
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_2/bin/python3
INFO:root:Python version: 3.10.15
INFO:root: * cm run script "get compiler"
INFO:root: ! load /liqid/CM/repos/local/cache/fbd4dfd9d7d34750/cm-cached-state.json
INFO:root: * cm run script "get generic-python-lib _package.dmiparser"
INFO:root: ! load /liqid/CM/repos/local/cache/820dba1ebe2e4ff8/cm-cached-state.json
INFO:root: * cm run script "get cache dir _name.mlperf-inference-sut-descriptions"
INFO:root: ! load /liqid/CM/repos/local/cache/1855764319f64f80/cm-cached-state.json
Generating SUT description file for aqua-pytorch-2.6.0.dev20241110
INFO:root: ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-sut-description/customize.py
INFO:root:* cm run script "run accuracy mlperf _coco2014"
INFO:root: * cm run script "get python3"
INFO:root: ! load /liqid/CM/repos/local/cache/349b8dcd2512411a/cm-cached-state.json
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_2/bin/python3
INFO:root:Python version: 3.10.15
INFO:root: * cm run script "get mlcommons inference src"
INFO:root: ! load /liqid/CM/repos/local/cache/005adf1880f0488d/cm-cached-state.json
INFO:root: * cm run script "get dataset coco2014 original _size.50 _with-sample-ids"
INFO:root: ! load /liqid/CM/repos/local/cache/1553754d81f94d36/cm-cached-state.json
INFO:root: * cm run script "get generic-python-lib _package.ijson"
INFO:root: ! load /liqid/CM/repos/local/cache/cae3dcb4aeb34f7c/cm-cached-state.json
INFO:root: * cm run script "get generic-python-lib _package.numpy"
INFO:root: ! load /liqid/CM/repos/local/cache/8ab7fa8c9c314ea1/cm-cached-state.json
INFO:root: ! cd /liqid/CM/repos/local/cache/7ddeb00397da4c1e
INFO:root: ! call /liqid/CM/repos/mlcommons@cm4mlops/script/process-mlperf-accuracy/run.sh from tmp-run.sh
/liqid/miniconda3/envs/mlperf_2/bin/python3 '/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/accuracy_coco.py' --mlperf-accuracy-file '/liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/mlperf_log_accuracy.json' --caption-path '/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/coco2014/captions/captions_source.tsv' --compliance-images-path /liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/images > '/liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/accuracy.txt'
Traceback (most recent call last):
File "/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/accuracy_coco.py", line 17, in <module>
from fid.fid_score import (
File "/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/fid/fid_score.py", line 47, in <module>
sys.path.insert("..", 0)
TypeError: 'str' object cannot be interpreted as an integer
CM error: Portable CM script failed (name = process-mlperf-accuracy, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
Please let me know if you need any more information and/or access to our cluster. Thank you so much for your help!
Update:
It seemed like the issue was that while installing loadgen, it deleted our torch for rocm version
, and installed torch for cuda version
. Now I reset my environment and delete all the previous python & inference cache. It now seems to be running now.
However, I am getting a type error at the last step, which seems to be collecting performance metric:
INFO:root:Python version: 3.11.7
INFO:root: * cm run script "get compiler"
INFO:root: ! load /liqid/CM/repos/local/cache/fbd4dfd9d7d34750/cm-cached-state.json
INFO:root: * cm run script "get generic-python-lib _package.dmiparser"
INFO:root: ! load /liqid/CM/repos/local/cache/dd4e67ac286f4fcf/cm-cached-state.json
INFO:root: * cm run script "get cache dir _name.mlperf-inference-sut-descriptions"
INFO:root: ! load /liqid/CM/repos/local/cache/1855764319f64f80/cm-cached-state.json
Generating SUT description file for aqua-pytorch-2.6.0.dev20241109
INFO:root: ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-sut-description/customize.py
SUT: aqua-reference-rocm-pytorch-v2.6.0.dev20241109-scc24-base, model: stable-diffusion-xl, scenario: Offline, target_qps updated as 0.470819
New config stored in /liqid/CM/repos/local/cache/7d7e33340faa4aaf/aqua/reference-implementation/rocm-device/pytorch-framework/framework-version-v2.6.0.dev20241109/scc24-base-config.yaml
Traceback (most recent call last):
File "/liqid/miniconda3/envs/mlperf_1/bin/cm", line 8, in <module>
sys.exit(run())
^^^^^
File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/cli.py", line 37, in run
r = cm.access(argv, out='con')
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 609, in access
r = action_addr(i)
^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 212, in run
r = self._run(i)
^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 1477, in _run
r = customize_code.preprocess(ii)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/script/run-mlperf-inference-app/customize.py", line 231, in preprocess
r = cm.access(ii)
^^^^^^^^^^^^^
File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 1297, in access
return cm.access(i)
^^^^^^^^^^^^
File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 609, in access
r = action_addr(i)
^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 212, in run
r = self._run(i)
^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 1552, in _run
r = prepare_and_run_script_with_postprocessing(run_script_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 4737, in prepare_and_run_script_with_postprocessing
rr = run_postprocess(customize_code, customize_common_input, recursion_spaces, env, state, const,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 4787, in run_postprocess
r = customize_code.postprocess(ii)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/customize.py", line 233, in postprocess
result, valid, power_result = mlperf_utils.get_result_from_log(env['CM_MLPERF_LAST_RELEASE'], model, scenario, output_dir, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/mlperf_utils.py", line 23, in get_result_from_log
result_ = checker.get_performance_metric(config, mlperf_model, result_path, scenario, None, None, has_power)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: get_performance_metric() takes 4 positional arguments but 7 were given
(mlperf_1) [ziw081@aqua ~]$
Here is the command I ran:
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--implementation=reference \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet --precision=float16 --env.CM_GET_PLATFORM_DETAILS=no
Hi @zixianwang2022 are you using the latest version of cm4mlops? (After cm pull repo
?). Because the latest code should be like this
Thanks @arjunsuresh ,
(mlperf_1) [ziw081@aqua ~]$ cm pull repo
=======================================================
Alias: mlcommons@cm4mlops
Local path: /liqid/CM/repos/mlcommons@cm4mlops
git pull
Already up to date.
CM alias for this repository: mlcommons@cm4mlops
=======================================================
Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
(mlperf_1) [ziw081@aqua ~]$
We have been using dev
branch as you described before.
(base) [ziw081@aqua mlcommons@cm4mlops]$ git branch
* dev
main
mlperf-inference
It looks a bit different than mlperf-inference
branch from your above link.
Should we use mlperf-inference
or dev
branch?
Yes, mlperf-inference
branch only. dev
branch is for "inference" repository which automatically happens in the workflow.
Hello @arjunsuresh ,
We are now able to run with the inference implementation. However, when we are trying to use our custom implementation, it is giving me error for looking for dev
branch in our custom inference
github. We don't have a dev
branch in our customized inference
branch. It seems like it is not looking for my specified test
branch.
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.7
INFO:root: * cm run script "get git repo _branch.dev _repo.https://github.com/zixianwang2022/mlperf-scc24"
INFO:root: * cm run script "detect os"
INFO:root: ! cd /liqid/CM/repos/local/cache/7c8b5fe21470423f
INFO:root: ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root: ! cd /liqid/CM/repos/local/cache/7c8b5fe21470423f
INFO:root: ! call /liqid/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
/liqid/CM/repos/local/cache/7c8b5fe21470423f
rm -rf inference
******************************************************
Current directory: /liqid/CM/repos/local/cache/7c8b5fe21470423f
Cloning mlperf-scc24 from https://github.com/zixianwang2022/mlperf-scc24
git clone -b dev https://github.com/zixianwang2022/mlperf-scc24 --depth 5 inference
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
CM error: Portable CM script failed (name = get-git-repo, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
Command to run:
# Custom Implementation
# Formal run
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet --precision=float16 \
--adr.mlperf-implementation.tags=_branch.test,_repo.https://github.com/zixianwang2022/mlperf-scc24 --adr.mlperf-implementation.version=custom --env.CM_GET_PLATFORM_DETAILS=no
Hi @zixianwang2022 I just removed the "dev" branch in the workflow. But your forked branch must still upto date with the "dev" branch of mlcommons/inference to pull in the latest changes. If so, you can just do cm pull repo
now.
Hi @arjunsuresh , I am not sure why, it is still giving me the same error. I have deleted all my inference cache and cm pull. Do you have any clues?
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference,src -f
CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference,src -f
CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference -f
CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=python -f
CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm pull repo
=======================================================
Alias: mlcommons@cm4mlops
Local path: /liqid/CM/repos/mlcommons@cm4mlops
git pull
Already up to date.
CM alias for this repository: mlcommons@cm4mlops
=======================================================
Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
(mlperf_1) [ziw081@aqua ~]$
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.7
INFO:root: * cm run script "get git repo _branch.dev _repo.https://github.com/zixianwang2022/mlperf-scc24"
INFO:root: * cm run script "detect os"
INFO:root: ! cd /liqid/CM/repos/local/cache/cd0cb819c5444147
INFO:root: ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root: ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root: ! cd /liqid/CM/repos/local/cache/cd0cb819c5444147
INFO:root: ! call /liqid/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
/liqid/CM/repos/local/cache/cd0cb819c5444147
rm -rf inference
******************************************************
Current directory: /liqid/CM/repos/local/cache/cd0cb819c5444147
Cloning mlperf-scc24 from https://github.com/zixianwang2022/mlperf-scc24
git clone -b dev https://github.com/zixianwang2022/mlperf-scc24 --depth 5 inference
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
CM error: Portable CM script failed (name = get-git-repo, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:
https://github.com/mlcommons/cm4mlops/issues
The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
My run command:
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
--model=sdxl \
--framework=pytorch \
--category=datacenter \
--scenario=Offline \
--execution_mode=test \
--device=rocm \
--quiet --precision=float16 \
--adr.mlperf-implementation.tags=_branch.test,_repo.https://github.com/zixianwang2022/mlperf-scc24 --adr.mlperf-implementation.version=custom --env.CM_GET_PLATFORM_DETAILS=no
Cm pull repo was successful?
Oh wait, I misread your message. cm pull was successful. Is it possible to create a dev
repo under my directory that is not the dev
from the original source code?
I deleted all the git init files when I clone it.
Hello @arjunsuresh , it seems like it is still reading content from dev
branch of my repo instead of test
branch that I specified in the command line. Is it expected?
My bad - it was hardcoded at one more place. Can you please try now?
We are able to run it now! @arjunsuresh
One last question is that how do we specify command line flags needed to pass into our custom main.py
?
That's great @zixianwang2022 . The run command is populated here. You can modify it accordingly - anything you pass via --env.
will be available in the env
dictionary inside customize.py
.
Thank you Arjun!
Hi, I am running the CM in a different system that I now have sudo access. I am running the following command but I got a
Python version: 3.11.10 Conflicting branches between version assigned and user specified.!
that I never experienced before. My conda env has python 3.11 and system-wide python is also 3.11.We tried the following system python version and conda python version, but I got the same error for all of them. System-wide: 3.9; Conda: 3.10 System-wide: 3.9; Conda: 3.11 System-wide: 3.11; Conda: 3.11
Do you have any clues?
All my conda environment and CM are freshly built, and I verified that they can ran manually in inference/text_to_image.
Here is my command to run cm.
Here is the error.
Thank you so much for your patience and helping us out!