mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
9 stars 12 forks source link

FBGEMM version mismatch on ARM #304

Open ayanchak1508 opened 3 days ago

ayanchak1508 commented 3 days ago

I was trying to run the DLRMv2 benchmark of MLPerf Inference on an ARM server using the instructions here.

I run into the issue when the tool tries to install torchrec==0.3.2 torchrec==0.3.2 requires fbgemm-gpu==0.3.2 but fbgemm-gpu only introduced support for ARM starting from v0.5.0: https://download.pytorch.org/whl/cpu/fbgemm-gpu/

I tried two alternate approaches:

  1. Build fbgemm-gpu v0.3.2 from source. This does not work because it needs a compiler with AVX-512 support (which is clearly absent on ARM).
  2. Try with a newer version of fbgemm-gpu (v0.5.0 or above) but the cm tool remains inflexible and keeps trying to search for v0.3.2

Previously, I did run the benchmark without any problems on ARM (without using the cm tool) using newer versions of fbgemm-gpu. (Note that I did need to use fbgemm-gpu-cpu too)

Command to reproduce the issue:

cm run script --tags=run-mlperf,inference,_r4.1-dev    --model=dlrm-v2-99.9    --implementation=reference    --framework=pytorch    --category=datacenter    --scenario=Server   --server_target_qps=10    --execution_mode=valid    --device=cpu    --quiet --repro

Error message:

ERROR: Could not find a version that satisfies the requirement fbgemm-gpu==0.3.2 (from versions: none)
ERROR: No matching distribution found for fbgemm-gpu==0.3.2

The repro folder and the logfile is present in the attached tarball. cm-repro.tar.gz

arjunsuresh commented 3 days ago

Hi @ayanchak1508 You can just remove the version requirement in this file locally which should be inside $HOME/repos/mlcommons@cm4mlops/script/

https://github.com/GATEOverflow/cm4mlops/blob/mlperf-inference/script/app-mlperf-inference-mlcommons-python/_cm.yaml#L1129

We never had success using a higher version of fbgemm with the available inference implementation. If you can share the exact versions which worked, we can test them.

ayanchak1508 commented 3 days ago

Thanks for the quick reply! Yes, indeed after changing the version, it seems to be working.

These are the versions (that changed from the default) that work for me: fbgemm_gpu==0.8.0+cpu fbgemm_gpu-cpu==0.8.0 torch==2.4.0 torchrec==0.8.0

I have attached the full requirements.txt file in case if needed requirements.txt

I sometimes run into a bus error (core dumped) error afterward, but it seems to be more of a memory capacity issue unrelated to the toolchain/benchmark?

arjunsuresh commented 3 days ago

Thanks a lot @ayanchak1508 . Let me check that. This issue might help with the bus error.

arjunsuresh commented 3 days ago

yes, with pytorch 2.4 we could use fbgemm_gpu==0.8.0 and it worked fine. We have removed the version dependency in the CM script now. You can just do cm pull repo and it should be visible.

arjunsuresh commented 3 days ago

Just to add ulimit=9999 was not enough to run 1000 inputs. I think it'll be incredibly hard to do a full run of 204800 inputs using the current reference implementation on CPUs.

ayanchak1508 commented 2 days ago

Thanks a lot for the quick updates!

I did a fresh, clean setup to see the effects. I have two observations:

  1. pip doesn't automatically know where to find fbgemm-gpu for ARM, it needs to be installed via pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cpu/
  2. I actually ran into more dependency conflicts this time, and the benchmark started complaining about functions it couldn't find inside modules (such as ModuleNotFoundError: No module named 'fbgemm_gpu.split_embedding_configs')

I'm not sure if I'm doing anything wrong, but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems. Maybe this is an ARM-specific problem?

Regarding the bus error problem, thank you again for the references. Is there any way to use the debug dataset or limit the max inputs, i.e., deviate from the official submission rules in any way? (of course I understand it wouldn't count as a valid submission, but I'm just interested in the model performance)

I guess one possible solution could be to edit the conf file manually, but is there a better way? (Sorry for bringing the bus error into this issue, we can move it to a separate issue if needed)

arjunsuresh commented 2 days ago

For 1, may be the problem is with the .whl file?

"but if I create a new virtual environment and use the requirements file I posted earlier, the benchmark runs without problems."

Is it on the same ARM machine? If so, you can try the venv for CM flow also as follows:

cm run script --tags=install,python-venv --name=mlperf
export CM_SCRIPT_EXTRA_CMD="--adr.python.name=mlperf"

For the bus error - what's the available RAM on the system?

ayanchak1508 commented 2 days ago

Sorry, I should have been more specific. Runs are on a clean and empty docker container (ubuntu:22.04) on an ARM server.

I created two python venvs (in the same container), one for installing packages through the CM-based flow and one for installing packages from the requirements file. Although I didn't use the command you mentioned, I simply created a normal python venv as mentioned here: https://docs.mlcommons.org/inference/install/ and ran the CM commands for the benchmark there. Does the command you mentioned do something more?

For the bus error, the RAM is not too big, it's about ~250GB (the docker container has no resource constraints). I remember I faced a similar problem before when I processed the dataset myself some time back, and had to move to a different machine with 512 GB RAM. So, I understand maybe its not big enough to run the entire dataset, but should be fine at least for the debug dataset?

arjunsuresh commented 2 days ago

Thank you.

Yes, the commands are a bit different. CM is a python package and when you use a venv for CM, it gets installed in the venv. Now when you run any workflow using CM, any available python on the system can be picked by the flow unless we force one using "cm run script - -tags=get,python" and doing the appropriate selection. The command I shared is a safer option as long as the name used is new.

Coming to 256GB, it should be good enough. We have run Dlrmv2 full comfortably on 192GB. It worked even on 64GB, but had to use a lot of swap space.

I believe your problem could be the shm size as docker is used. Are you explicitly setting shm size during docker run? We typically set 32GB shm size for dlrm.

ayanchak1508 commented 2 days ago

Thank you very much for the clarification!

I did not set the shm size, and the default seems to be 64MB, much smaller than the 32GB you mentioned. I will try it out (both using the command you mentioned and increasing the shm size), and get back to you.

Thanks once again for all the quick help.

arjunsuresh commented 2 days ago

Sure @ayanchak1508 Just a correction to what I told earlier - the 64G system where we had run dlrmv2 was on GPUs and not CPUs. On CPUs we could only do a test run on 192G for 10 inputs.

ayanchak1508 commented 2 days ago

Update:

  1. Increasing the shm size to 32G fixes the bus error problem, thank you! I can now run the benchmark, albeit with a very low qps
  2. Using the CM venv flow as you mentioned before doesn't help, it runs into the same problems:
ImportError: cannot import name 'DLRM_DCN' from 'torchrec.models.dlrm' (/root/CM/repos/local/cache/b1d060ef5c0c4217/mlperf/lib/python3.10/site-packages/torchrec/models/dlrm.py)
ModuleNotFoundError: No module named 'fbgemm_gpu.split_embedding_configs'

These are the packages it installs in the mlperf venv: current.txt Doing a diff with the requirements file I posted before, and then manually installing the correct package versions in the mlperf venv solves the problem:

pip install torch==2.4.0 torchrec==0.8.0
pip uninstall fbgemm-gpu
pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/cpu/

I am not sure why I had to reinstall the same version of fbgemm-gpu but otherwise it runs into the ModuleNotFoundError