mlfoundations / dclm

DataComp for Language Models
MIT License
1.15k stars 104 forks source link

Unable to ray up (part 2) #79

Closed tonychenxyz closed 1 month ago

tonychenxyz commented 1 month ago

Previously in issue #69 , I was able to ray up with the following config yaml

cluster_name: my-cluster
min_workers: 1
max_workers: 10
upscaling_speed: 1.0
docker:
  image: "rayproject/ray:latest"
  container_name: "ray_container"
  pull_before_run: True
setup_commands:
    - wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -O miniconda.sh
    - bash ~/miniconda.sh -f -b -p /tmp/miniconda3/
    - echo 'export PATH="/tmp/miniconda3/bin/:$PATH"' >> ~/.bashrc
    - pip install --upgrade pip setuptools wheel
    - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
    - pip install boto3==1.26.90
    - pip install s3fs==2022.11.0
    - pip install psutil
    - pip install pyarrow
    - pip install 'pandas==2.1.4'
    - pip install git+https://github.com/mlfoundations/open_lm.git
    - git clone https://github.com/mlfoundations/dclm.git
provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: False

But now ray up with the same script gives error

Collecting torchmetrics<0.10.0,>=0.7.0 (from mosaicml->open_lm==0.0.34)
  Using cached torchmetrics-0.9.3-py3-none-any.whl.metadata (17 kB)
Collecting mosaicml (from open_lm==0.0.34)
  Using cached mosaicml-0.12.0-py3-none-any.whl.metadata (27 kB)
WARNING: Ignoring version 0.12.0 of mosaicml since it has invalid metadata:
Requested mosaicml from https://files.pythonhosted.org/packages/dc/3a/a36f940ca092403079579726f2bc8df9c0969e0840472ec091b6b2999f32/mosaicml-0.12.0-py3-none-any.whl (from open_lm==0.0.34) has invalid metadata: .* suffix can only be used with `==` or `!=` operators
    mosaicml-streaming (<0.3.*) ; extra == 'all'
                        ~~~~~^
Please use pip<24.1 if you need to use this version.
  Using cached mosaicml-0.11.1-py3-none-any.whl.metadata (27 kB)
WARNING: Ignoring version 0.11.1 of mosaicml since it has invalid metadata:
Requested mosaicml from https://files.pythonhosted.org/packages/76/d8/c9a0fef6d3afd1ece0513b35eeb11af4a8fb546323f653a680c2bb886e95/mosaicml-0.11.1-py3-none-any.whl (from open_lm==0.0.34) has invalid metadata: .* suffix can only be used with `==` or `!=` operators
    mosaicml-streaming (<0.2.*) ; extra == 'all'
                        ~~~~~^
Please use pip<24.1 if you need to use this version.
  Using cached mosaicml-0.11.0-py3-none-any.whl.metadata (27 kB)
WARNING: Ignoring version 0.11.0 of mosaicml since it has invalid metadata:
Requested mosaicml from https://files.pythonhosted.org/packages/8c/12/391990a20e8eefa280a0692be7a7e4f3c281c9fe1ecf0c9566400db4af31/mosaicml-0.11.0-py3-none-any.whl (from open_lm==0.0.34) has invalid metadata: .* suffix can only be used with `==` or `!=` operators
    mosaicml-streaming (<0.2.*) ; extra == 'all'
                        ~~~~~^
Please use pip<24.1 if you need to use this version.
  Using cached mosaicml-0.10.1-py3-none-any.whl.metadata (27 kB)
Collecting torchmetrics<0.8,>=0.7.0 (from mosaicml->open_lm==0.0.34)
  Using cached torchmetrics-0.7.3-py3-none-any.whl.metadata (20 kB)
Collecting mosaicml (from open_lm==0.0.34)
  Using cached mosaicml-0.10.0-py3-none-any.whl.metadata (27 kB)
  Using cached mosaicml-0.9.0-py3-none-any.whl.metadata (27 kB)
Collecting torch-optimizer<0.2,>=0.1.0 (from mosaicml->open_lm==0.0.34)
  Using cached torch_optimizer-0.1.0-py3-none-any.whl.metadata (53 kB)
Collecting mosaicml (from open_lm==0.0.34)
  Using cached mosaicml-0.8.2-py3-none-any.whl.metadata (27 kB)
  Using cached mosaicml-0.8.1-py3-none-any.whl.metadata (27 kB)
  Using cached mosaicml-0.8.0-py3-none-any.whl.metadata (27 kB)
  Using cached mosaicml-0.7.1-py3-none-any.whl.metadata (26 kB)
Collecting pyyaml>=5.1 (from datasets->open_lm==0.0.34)
  Using cached PyYAML-5.4.1.tar.gz (175 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [48 lines of output]
      running egg_info
      writing lib3/PyYAML.egg-info/PKG-INFO
      writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
      writing top-level names to lib3/PyYAML.egg-info/top_level.txt
      Traceback (most recent call last):
        File "/tmp/miniconda3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/tmp/miniconda3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/tmp/miniconda3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 332, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 302, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 318, in run_setup
          exec(code, locals())
        File "<string>", line 271, in <module>
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 117, in setup
          return distutils.core.setup(**attrs)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 183, in setup
          return run_commands(dist)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 199, in run_commands
          dist.run_commands()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 954, in run_commands
          self.run_command(cmd)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 950, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 973, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 311, in run
          self.find_sources()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 319, in find_sources
          mm.run()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 540, in run
          self.add_defaults()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 578, in add_defaults
          sdist.add_defaults(self)
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 108, in add_defaults
          super().add_defaults()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 238, in add_defaults
          self._add_defaults_ext()
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 323, in _add_defaults_ext
          self.filelist.extend(build_ext.get_source_files())
        File "<string>", line 201, in get_source_files
        File "/tmp/pip-build-env-8u2pfflm/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
          raise AttributeError(attr)
      AttributeError: cython_sources
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Shared connection to 35.82.33.80 closed.
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/shared/share_mala/conda_envs/dclm/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
    self.do_update()
  File "/shared/share_mala/conda_envs/dclm/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 490, in do_update
    self.cmd_runner.run(cmd, run_env="auto")
  File "/shared/share_mala/conda_envs/dclm/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 493, in run
    return self.ssh_command_runner.run(
  File "/shared/share_mala/conda_envs/dclm/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
  File "/shared/share_mala/conda_envs/dclm/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!

  Failed to setup head node.
andrewsiah commented 1 month ago

Hi there, likewise I got the error above.

I think it can be traced to a PyYaml issue, https://github.com/yaml/pyyaml/issues/724

Someone related suggestions on stackoverflow, is to update awscli, as it has a pyyaml issue https://stackoverflow.com/questions/76868274/build-failed-with-aws-ebcli-on-python-3-11-4

https://github.com/aws/aws-cli/issues/8036#issuecomment-1638544754

But that doesn't fixed things for me.

Here is my ray config file:

cluster_name: andrew2-cluster
min_workers: 1
max_workers: 10
upscaling_speed: 1.0
docker:
  image: "rayproject/ray:latest"
  container_name: "ray_container"
  pull_before_run: True
setup_commands:
  - sudo apt update
  - sudo apt install cmake build-essential
  - sudo apt install g++-9
  - sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
  - wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -O miniconda.sh
  - bash ~/miniconda.sh -f -b -p /tmp/miniconda3/
  - echo 'export PATH="/tmp/miniconda3/bin/:$PATH"' >> ~/.bashrc
  - pip install --upgrade pip setuptools wheel
  - pip install --force-reinstall -v "PyYAML==6.0.1" --no-build-isolation
  - pip install awscli --no-build-isolation
  - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
  - pip install boto3==1.26.90
  - pip install s3fs==2022.11.0
  - pip install psutil
  - pip install pyarrow
  - pip install 'pandas==2.1.4'
  - pip install fasttext
  - pip install git+https://github.com/mlfoundations/open_lm.git
  - git clone https://github.com/mlfoundations/dclm.git
provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: False
GeorgiosSmyrnis commented 1 month ago

Hi @tonychenxyz , @andrewsiah ,

I looked into this and made some modifications to the yaml file, and have a few variants in which the packages are installed properly. Here is the config that I used - can you try this after making the account specific edits that I marked in the comments?

cluster_name: test-processing
max_workers: 2
upscaling_speed: 1.0
available_node_types:
    ray.head.default:
        resources: {}
        node_config:
            ImageId: ami-0c5cce1d70efb41f5
            InstanceType: i4i.4xlarge
            IamInstanceProfile:
                # Replace 000000000000 with your IAM account 12-digit ID
                Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1
    ray.worker.default:
        min_workers: 2
        max_workers: 2
        node_config:
            ImageId: ami-0c5cce1d70efb41f5
            InstanceType: i4i.4xlarge
            IamInstanceProfile:
                # Replace 000000000000 with your IAM account 12-digit ID
                Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1

provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: False

setup_commands:
    - sudo mkfs -t xfs /dev/nvme1n1
    - sudo mount /dev/nvme1n1 /tmp
    - sudo chown -R $USER /tmp
    - sudo chmod -R 777 /tmp
    - wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -O miniconda.sh
    - bash ~/miniconda.sh -f -b -p /tmp/miniconda3/
    - echo 'export PATH="/tmp/miniconda3/bin/:$PATH"' >> ~/.bashrc
    # Include your AWS CREDS here
    - echo 'export AWS_ACCESS_KEY_ID=' >> ~/.bashrc
    - echo 'export AWS_SECRET_ACCESS_KEY=' >> ~/.bashrc
    - pip install --upgrade pip setuptools wheel
    - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
    - pip install boto3==1.26.90
    - pip install s3fs==2022.11.0
    - pip install psutil
    - pip install pysimdjson
    - pip install pyarrow
    - git clone https://github.com/mlfoundations/dclm.git
    - pip install -r dclm/requirements.txt
    - cd dclm && python3 setup.py install
jeffreywpli commented 1 month ago

Hi @tonychenxyz , @andrewsiah just checking in, were you were able to resolve your issue?

andrewsiah commented 1 month ago

Hey, yeap, thanks for the help!