microsoft / RD-Agent

Research and development (R&D) is crucial for the enhancement of industrial productivity, especially in the AI era, where the core aspects of R&D are mainly focused on data and models. We are committed to automating these high-value generic R&D processes through our open source R&D automation tool RD-Agent, which lets AI drive data-driven AI.
https://rdagent.azurewebsites.net/
MIT License
1.2k stars 97 forks source link

RD-Agent fin_model fails on 1-2 GPU systems (w/ fix) #499

Open JIBSIL opened 1 day ago

JIBSIL commented 1 day ago

🐛 Bug Description

The default CUDA device ordinal is incorrect for RD-Agent model generation

To Reproduce

Steps to reproduce the behavior:

  1. Use a system that is GPU-enabled and has fewer than three GPUs
  2. Run rdagent fin_model
  3. You will get an invalid device ordinal error

Expected Behavior

It should be expected that the user has at least one GPU device, not three minimum. Therefore the default GPU should be 0.

Screenshot

Image Image

Environment

Note: Users can run rdagent collect_info to get system information and paste it directly here.

(rdagent) E:\RD-Agent>rdagent collect_info
2024-11-28 15:48:52.150 | WARNING  | rdagent.oai.llm_utils:<module>:48 - llama is not installed.
2024-11-28 15:48:57.781 | INFO     | rdagent.app.utils.info:sys_info:22 - Name of current operating system: Windows
2024-11-28 15:48:57.789 | INFO     | rdagent.app.utils.info:sys_info:22 - Processor architecture: AMD64
2024-11-28 15:48:57.796 | INFO     | rdagent.app.utils.info:sys_info:22 - System, version, and hardware information: Windows-10-10.0.19045-SP0
2024-11-28 15:48:57.807 | INFO     | rdagent.app.utils.info:sys_info:22 - Version number of the system: 10.0.19045
2024-11-28 15:48:57.814 | INFO     | rdagent.app.utils.info:python_info:29 - Python version: 3.10.15 | packaged by Anaconda, Inc. | (main, Oct  3 2024, 07:22:19) [MSC v.1929 64 bit (AMD64)]
2024-11-28 15:48:58.056 | INFO     | rdagent.app.utils.info:docker_info:39 - Container ID: a33c4ca045838432aa7a8e0a299edbf8cc8d69b4848699412598bf4e3238b23f
2024-11-28 15:48:58.065 | INFO     | rdagent.app.utils.info:docker_info:40 - Container Name: kind_tharp
2024-11-28 15:48:58.085 | INFO     | rdagent.app.utils.info:docker_info:41 - Container Status: exited
2024-11-28 15:48:58.108 | INFO     | rdagent.app.utils.info:docker_info:42 - Image ID used by the container: sha256:f219da361b6969fea8ea5c3c8040db88ca8164bc7629def2f53c4197ca0ff2b9
2024-11-28 15:48:58.164 | INFO     | rdagent.app.utils.info:docker_info:43 - Image tag used by the container: ['local_qlib:latest']
2024-11-28 15:48:58.174 | INFO     | rdagent.app.utils.info:docker_info:44 - Container port mapping: {}
2024-11-28 15:48:58.186 | INFO     | rdagent.app.utils.info:docker_info:45 - Container Label: {'com.nvidia.volumes.needed': 'nvidia_driver', 'org.opencontainers.image.ref.name': 'ubuntu', 'org.opencontainers.image.version': '22.04'}
2024-11-28 15:48:58.449 | INFO     | rdagent.app.utils.info:docker_info:46 - Startup Commands: nvidia-smi
2024-11-28 15:48:58.591 | INFO     | rdagent.app.utils.info:rdagent_info:54 - RD-Agent version: 0.3.0
2024-11-28 15:48:59.777 | INFO     | rdagent.app.utils.info:rdagent_info:76 - Package version: ['pydantic-settings==2.6.1', 'python-Levenshtein==0.26.1', 'scikit-learn==1.5.2', 'filelock==3.16.1', 'loguru==0.7.2', 'fire==0.7.0', 'fuzzywuzzy==0.18.0', 'openai==1.55.3', 'numpy==1.26.4', 'pandas==2.2.3', 'pandarallel==1.6.5', 'matplotlib==3.9.2', 'langchain==0.3.9', 'langchain-community==0.3.8', 'tiktoken==0.8.0', 'pymupdf==1.24.14', 'pypdf==5.1.0', 'azure-ai-formrecognizer==3.3.3', 'tables==3.10.1', 'tree-sitter-python==0.23.4', 'tree-sitter==0.23.2', 'python-dotenv==1.0.1', 'docker==7.1.0', 'streamlit==1.40.2', 'plotly==5.24.1', 'st-theme==1.2.3', 'selenium==4.27.1', 'kaggle==1.6.17', 'nbformat==5.10.4', 'seaborn==0.13.2', 'setuptools-scm==8.1.0']

Additional Notes

Referenced issues: #442 #445

Fix

Image

  1. Browse to git_ignore_folder/
  2. Find the folder with conf.yaml in it
  3. Change the 2 to 0 on line 65

Referenced code: https://github.com/microsoft/RD-Agent/blob/main/rdagent/scenarios/qlib/experiment/model_template/conf.yaml#L65

I am reluctant to make a one-line pull request, but if you are a contributor please clarify if this is expected behaviour from RD-Agent.

SunsetWolf commented 1 day ago

We are very glad to receive your issue, I found the following in the information you provided. You are running RD-Agent on Windows, but RD-Agent is not currently supported on Windows. There is a badge in the README called platform that explains this, so we recommend that you run RD-Agent on Linux.

JIBSIL commented 10 hours ago

We are very glad to receive your issue, I found the following in the information you provided. You are running RD-Agent on Windows, but RD-Agent is not currently supported on Windows. There is a badge in the README called platform that explains this, so we recommend that you run RD-Agent on Linux.

Hi, I'm running the Dockerized (WSL2/Docker Desktop) version of RD-Agent though; the environment that the python code is running in should be a linux based one