mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) to make it easier to build and benchmark AI systems across diverse models, data sets, software and hardware
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
14 stars 20 forks source link

CM error: Conflicting branches between version assigned and user specified.! #513

Closed zixianwang2022 closed 1 week ago

zixianwang2022 commented 2 weeks ago

Hi, I am running the CM in a different system that I now have sudo access. I am running the following command but I got a Python version: 3.11.10 Conflicting branches between version assigned and user specified.! that I never experienced before. My conda env has python 3.11 and system-wide python is also 3.11.

We tried the following system python version and conda python version, but I got the same error for all of them. System-wide: 3.9; Conda: 3.10 System-wide: 3.9; Conda: 3.11 System-wide: 3.11; Conda: 3.11

Do you have any clues?

All my conda environment and CM are freshly built, and I verified that they can ran manually in inference/text_to_image.

Here is my command to run cm.

cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
    --precision=float16

Here is the error.

Using MLCommons Inference source from '/liqid/CM/repos/local/cache/6d9817f05ce14222/inference'
In find performance mode: using 1 as target_qps
Output Dir: '/liqid/CM/repos/local/cache/3611def590ed444b/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241109-scc24-base/stable-diffusion-xl/offline/performance/run_1'
stable-diffusion-xl.Offline.target_qps = 0.05
stable-diffusion-xl.Offline.max_query_count = 10
stable-diffusion-xl.Offline.min_query_count = 10
stable-diffusion-xl.Offline.min_duration = 0
stable-diffusion-xl.Offline.sample_concatenate_permutation = 0

INFO:root:    * cm run script "get loadgen"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /liqid/CM/repos/local/cache/1dce25751ee2430e
INFO:root:             ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:      * cm run script "get python3"
INFO:root:           ! load /liqid/CM/repos/local/cache/d6dd3766360c4a64/cm-cached-state.json
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.10
INFO:root:      * cm run script "get mlcommons inference src _branch.dev"
INFO:root:        * cm run script "detect os"
INFO:root:               ! cd /liqid/CM/repos/local/cache/01dee8c558b248e9
INFO:root:               ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:               ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /liqid/CM/repos/local/cache/d6dd3766360c4a64/cm-cached-state.json
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.10

CM error: Conflicting branches between version assigned and user specified.!
(mlperf_1) [ziw081@aqua ~]$ python --version
Python 3.11.10
(mlperf_1) [ziw081@aqua ~]$ conda deactivate && conda deactivate
[ziw081@aqua ~]$ python --version
Python 3.11.7

Thank you so much for your patience and helping us out!

zixianwang2022 commented 2 weeks ago

Update: I tried removing all my cache and reran everything. However, I am still seeing the same

CM error: Conflicting branches between version assigned and user specified.!

Update: I tried updating my conda python env to 3.11.7, matching system-wide's python env. I removed all the CM cache, but still getting the same error.

arjunsuresh commented 2 weeks ago

Hi @zixianwang2022 Can you please do cm pull repo and retry? We had to use the dev` branch of inference-src temporarily as for SCC24 we need to dump 10 images during the accuracy run.

zixianwang2022 commented 2 weeks ago

Hello @arjunsuresh , We are seeing a couple more different errors when switching to dev branch.

We ran

cm rm cache --tags=inference,src -f

Then we ran to give us performance estimation. We are not seeing any errors from below.

# Performance Estimation for Offline Scenario
cm run script --tags=run-mlperf,inference,_find-performance,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm  \
   --quiet \
    --precision=float16

Then we we ran the actual performance test from the following command

# Formal run
# Official Implementation 
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm \
   --quiet --precision=float16

We are seeing the following error

INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_2/bin/python3
INFO:root:Python version: 3.10.15
INFO:root:  * cm run script "get compiler"
INFO:root:       ! load /liqid/CM/repos/local/cache/fbd4dfd9d7d34750/cm-cached-state.json
INFO:root:  * cm run script "get generic-python-lib _package.dmiparser"
INFO:root:       ! load /liqid/CM/repos/local/cache/820dba1ebe2e4ff8/cm-cached-state.json
INFO:root:  * cm run script "get cache dir _name.mlperf-inference-sut-descriptions"
INFO:root:       ! load /liqid/CM/repos/local/cache/1855764319f64f80/cm-cached-state.json
Generating SUT description file for aqua-pytorch-2.6.0.dev20241110
INFO:root:       ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-sut-description/customize.py
INFO:root:* cm run script "run accuracy mlperf _coco2014"
INFO:root:  * cm run script "get python3"
INFO:root:       ! load /liqid/CM/repos/local/cache/349b8dcd2512411a/cm-cached-state.json
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_2/bin/python3
INFO:root:Python version: 3.10.15
INFO:root:  * cm run script "get mlcommons inference src"
INFO:root:       ! load /liqid/CM/repos/local/cache/005adf1880f0488d/cm-cached-state.json
INFO:root:  * cm run script "get dataset coco2014 original _size.50 _with-sample-ids"
INFO:root:       ! load /liqid/CM/repos/local/cache/1553754d81f94d36/cm-cached-state.json
INFO:root:  * cm run script "get generic-python-lib _package.ijson"
INFO:root:       ! load /liqid/CM/repos/local/cache/cae3dcb4aeb34f7c/cm-cached-state.json
INFO:root:  * cm run script "get generic-python-lib _package.numpy"
INFO:root:       ! load /liqid/CM/repos/local/cache/8ab7fa8c9c314ea1/cm-cached-state.json
INFO:root:       ! cd /liqid/CM/repos/local/cache/7ddeb00397da4c1e
INFO:root:       ! call /liqid/CM/repos/mlcommons@cm4mlops/script/process-mlperf-accuracy/run.sh from tmp-run.sh
/liqid/miniconda3/envs/mlperf_2/bin/python3 '/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/accuracy_coco.py' --mlperf-accuracy-file '/liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/mlperf_log_accuracy.json' --caption-path '/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/coco2014/captions/captions_source.tsv' --compliance-images-path /liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/images  > '/liqid/CM/repos/local/cache/235eeac641f54534/test_results/aqua-reference-rocm-pytorch-v2.6.0.dev20241110-scc24-base/stable-diffusion-xl/offline/accuracy/accuracy.txt'
Traceback (most recent call last):
  File "/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/accuracy_coco.py", line 17, in <module>
    from fid.fid_score import (
  File "/liqid/CM/repos/local/cache/b227a0eae6114495/inference/text_to_image/tools/fid/fid_score.py", line 47, in <module>
    sys.path.insert("..", 0)
TypeError: 'str' object cannot be interpreted as an integer

CM error: Portable CM script failed (name = process-mlperf-accuracy, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

Please let me know if you need any more information and/or access to our cluster. Thank you so much for your help!

zixianwang2022 commented 2 weeks ago

Update:

It seemed like the issue was that while installing loadgen, it deleted our torch for rocm version, and installed torch for cuda version. Now I reset my environment and delete all the previous python & inference cache. It now seems to be running now.

However, I am getting a type error at the last step, which seems to be collecting performance metric:

INFO:root:Python version: 3.11.7
INFO:root:  * cm run script "get compiler"
INFO:root:       ! load /liqid/CM/repos/local/cache/fbd4dfd9d7d34750/cm-cached-state.json
INFO:root:  * cm run script "get generic-python-lib _package.dmiparser"
INFO:root:       ! load /liqid/CM/repos/local/cache/dd4e67ac286f4fcf/cm-cached-state.json
INFO:root:  * cm run script "get cache dir _name.mlperf-inference-sut-descriptions"
INFO:root:       ! load /liqid/CM/repos/local/cache/1855764319f64f80/cm-cached-state.json
Generating SUT description file for aqua-pytorch-2.6.0.dev20241109
INFO:root:       ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-sut-description/customize.py

SUT: aqua-reference-rocm-pytorch-v2.6.0.dev20241109-scc24-base, model: stable-diffusion-xl, scenario: Offline, target_qps updated as 0.470819
New config stored in /liqid/CM/repos/local/cache/7d7e33340faa4aaf/aqua/reference-implementation/rocm-device/pytorch-framework/framework-version-v2.6.0.dev20241109/scc24-base-config.yaml
Traceback (most recent call last):
  File "/liqid/miniconda3/envs/mlperf_1/bin/cm", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/cli.py", line 37, in run
    r = cm.access(argv, out='con')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 609, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 212, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 1477, in _run
    r = customize_code.preprocess(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/script/run-mlperf-inference-app/customize.py", line 231, in preprocess
    r = cm.access(ii)
        ^^^^^^^^^^^^^
  File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 1297, in access
    return cm.access(i)
           ^^^^^^^^^^^^
  File "/liqid/miniconda3/envs/mlperf_1/lib/python3.11/site-packages/cmind/core.py", line 609, in access
    r = action_addr(i)
        ^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 212, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 1552, in _run
    r = prepare_and_run_script_with_postprocessing(run_script_input)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 4737, in prepare_and_run_script_with_postprocessing
    rr = run_postprocess(customize_code, customize_common_input, recursion_spaces, env, state, const,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/automation/script/module.py", line 4787, in run_postprocess
    r = customize_code.postprocess(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/script/app-mlperf-inference/customize.py", line 233, in postprocess
    result, valid, power_result = mlperf_utils.get_result_from_log(env['CM_MLPERF_LAST_RELEASE'], model, scenario, output_dir, mode)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/liqid/CM/repos/mlcommons@cm4mlops/script/get-mlperf-inference-utils/mlperf_utils.py", line 23, in get_result_from_log
    result_ = checker.get_performance_metric(config, mlperf_model, result_path, scenario, None, None, has_power)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: get_performance_metric() takes 4 positional arguments but 7 were given
(mlperf_1) [ziw081@aqua ~]$ 

Here is the command I ran:

cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --implementation=reference \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm \
   --quiet --precision=float16    --env.CM_GET_PLATFORM_DETAILS=no
arjunsuresh commented 2 weeks ago

Hi @zixianwang2022 are you using the latest version of cm4mlops? (After cm pull repo?). Because the latest code should be like this

zixianwang2022 commented 2 weeks ago

Thanks @arjunsuresh ,

(mlperf_1) [ziw081@aqua ~]$ cm pull repo
=======================================================
Alias:    mlcommons@cm4mlops

Local path: /liqid/CM/repos/mlcommons@cm4mlops

git pull

Already up to date.

CM alias for this repository: mlcommons@cm4mlops
=======================================================

Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
(mlperf_1) [ziw081@aqua ~]$ 

We have been using dev branch as you described before.

(base) [ziw081@aqua mlcommons@cm4mlops]$ git branch
* dev
  main
  mlperf-inference

It looks a bit different than mlperf-inference branch from your above link.

Should we use mlperf-inference or dev branch?

arjunsuresh commented 2 weeks ago

Yes, mlperf-inference branch only. dev branch is for "inference" repository which automatically happens in the workflow.

zixianwang2022 commented 2 weeks ago

Hello @arjunsuresh ,

We are now able to run with the inference implementation. However, when we are trying to use our custom implementation, it is giving me error for looking for dev branch in our custom inference github. We don't have a dev branch in our customized inference branch. It seems like it is not looking for my specified test branch.

INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.7
INFO:root:      * cm run script "get git repo _branch.dev _repo.https://github.com/zixianwang2022/mlperf-scc24"
INFO:root:        * cm run script "detect os"
INFO:root:               ! cd /liqid/CM/repos/local/cache/7c8b5fe21470423f
INFO:root:               ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:               ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:             ! cd /liqid/CM/repos/local/cache/7c8b5fe21470423f
INFO:root:             ! call /liqid/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
/liqid/CM/repos/local/cache/7c8b5fe21470423f
rm -rf inference
******************************************************
Current directory: /liqid/CM/repos/local/cache/7c8b5fe21470423f

Cloning mlperf-scc24 from https://github.com/zixianwang2022/mlperf-scc24

git clone  -b dev https://github.com/zixianwang2022/mlperf-scc24 --depth 5 inference

Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin

CM error: Portable CM script failed (name = get-git-repo, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

Command to run:

# Custom Implementation 
# Formal run
cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm \
   --quiet --precision=float16 \
   --adr.mlperf-implementation.tags=_branch.test,_repo.https://github.com/zixianwang2022/mlperf-scc24 --adr.mlperf-implementation.version=custom  --env.CM_GET_PLATFORM_DETAILS=no
arjunsuresh commented 2 weeks ago

Hi @zixianwang2022 I just removed the "dev" branch in the workflow. But your forked branch must still upto date with the "dev" branch of mlcommons/inference to pull in the latest changes. If so, you can just do cm pull repo now.

zixianwang2022 commented 2 weeks ago

Hi @arjunsuresh , I am not sure why, it is still giving me the same error. I have deleted all my inference cache and cm pull. Do you have any clues?

(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference,src -f

CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference,src -f

CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=inference -f

CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm rm cache --tags=python -f

CM error: artifact(s) not found!
(mlperf_1) [ziw081@aqua ~]$ cm pull repo
=======================================================
Alias:    mlcommons@cm4mlops

Local path: /liqid/CM/repos/mlcommons@cm4mlops

git pull

Already up to date.

CM alias for this repository: mlcommons@cm4mlops
=======================================================

Reindexing all CM artifacts. Can take some time ...
Took 0.5 sec.
(mlperf_1) [ziw081@aqua ~]$ 
INFO:root:Path to Python: /liqid/miniconda3/envs/mlperf_1/bin/python3
INFO:root:Python version: 3.11.7
INFO:root:      * cm run script "get git repo _branch.dev _repo.https://github.com/zixianwang2022/mlperf-scc24"
INFO:root:        * cm run script "detect os"
INFO:root:               ! cd /liqid/CM/repos/local/cache/cd0cb819c5444147
INFO:root:               ! call /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:               ! call "postprocess" from /liqid/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:             ! cd /liqid/CM/repos/local/cache/cd0cb819c5444147
INFO:root:             ! call /liqid/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
/liqid/CM/repos/local/cache/cd0cb819c5444147
rm -rf inference
******************************************************
Current directory: /liqid/CM/repos/local/cache/cd0cb819c5444147

Cloning mlperf-scc24 from https://github.com/zixianwang2022/mlperf-scc24

git clone  -b dev https://github.com/zixianwang2022/mlperf-scc24 --depth 5 inference

Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin
Cloning into 'inference'...
warning: Could not find remote branch dev to clone.
fatal: Remote branch dev not found in upstream origin

CM error: Portable CM script failed (name = get-git-repo, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

My run command:

cm run script --tags=run-mlperf,inference,_r4.1-dev,_short,_scc24-base \
   --model=sdxl \
   --framework=pytorch \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=rocm \
   --quiet --precision=float16 \
   --adr.mlperf-implementation.tags=_branch.test,_repo.https://github.com/zixianwang2022/mlperf-scc24 --adr.mlperf-implementation.version=custom  --env.CM_GET_PLATFORM_DETAILS=no
arjunsuresh commented 2 weeks ago

Cm pull repo was successful?

zixianwang2022 commented 2 weeks ago

Oh wait, I misread your message. cm pull was successful. Is it possible to create a dev repo under my directory that is not the dev from the original source code?

I deleted all the git init files when I clone it.

zixianwang2022 commented 2 weeks ago

Hello @arjunsuresh , it seems like it is still reading content from dev branch of my repo instead of test branch that I specified in the command line. Is it expected?

arjunsuresh commented 2 weeks ago

My bad - it was hardcoded at one more place. Can you please try now?

zixianwang2022 commented 2 weeks ago

We are able to run it now! @arjunsuresh

One last question is that how do we specify command line flags needed to pass into our custom main.py?

arjunsuresh commented 2 weeks ago

That's great @zixianwang2022 . The run command is populated here. You can modify it accordingly - anything you pass via --env. will be available in the env dictionary inside customize.py.

zixianwang2022 commented 2 weeks ago

Thank you Arjun!