modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.32k stars 143 forks source link

Unable to process data with ray executor on Kubernetes Ray Cluster: #345

Closed Fatima-0SA closed 2 weeks ago

Fatima-0SA commented 1 month ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

Ubuntu

Installation Method 安装方式

from source, for distributed processing

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.10.14

Describe the bug 描述这个bug

I have created a RayCluster deployed on Kubernetes within a VM -ray version 2.31.0-. I have forwarded the head ports:

I am trying to run the process_on_ray demo from a local docker container that has data juicer installed on it. For processing on ray, I have only changed the ray address from 'auto' to: ray_address: 'ray://host.docker.internal:10001' and reduced the processing steps in the demo to include only average_line_length_filter.

Running the above, I managed to get the connection initialization successfully, and I can see the jobs are scheduled within the ray dashboard, however, when it comes to reading the JSON file as a dataset, and trying to get the columns of the dataset, I face RuntimeError: Global Node is not initialized. This applies to any kind of processing step I want to perform on the dataset object.

Any insight about the global node initialization error? or is there any specific config. to be done when deploying the RayCluster?

To Reproduce 如何复现

  1. Deploy RayCluster on kubernetes
  2. Forward the head pod to port 10001
  3. Create a Docker container that has data-juicer
  4. Change process_on_ray configuration to be similar to the config mentioned below
  5. Run command from the docker container: python data_juicer/tools/process_data.py --config data_juicer/demos/process_on_ray/configs/demo.yaml

Configs 配置信息

Process config example for dataset

global parameters

project_name: 'ray-demo' executor_type: 'ray' dataset_path: './data_juicer/demos/process_on_ray/data/demo-dataset.json' # path to your dataset directory or file ray_address: 'ray://host.docker.internal:10001' # change to your ray cluster address, e.g., ray://: export_path: './outputs/demo/demo-processed'

process schedule

a list of several process operators with their arguments

process:

Logs 报错日志

No response

Screenshots 截图

image image image image image

Additional 额外信息

No response

drcege commented 1 month ago

@pan-x-c Perhaps related https://github.com/ray-project/ray/issues/41333

Fatima-0SA commented 1 month ago

When I downgraded to ray==2.10.0 (to match the version specified in /data_juicer/environments/dist_requires.txt), I started to receive another error related to GCS:

2024-07-09 10:25:49 | INFO     | data_juicer.core.ray_executor:113 - Loading dataset with Ray...
[2024-07-09 10:26:05,833 E 4398 4398] gcs_rpc_client.h:212: Failed to connect to GCS at address XX.XXX.X.X:6379 within 5 seconds.
[2024-07-09 10:27:00,872 E 4398 4519] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
pan-x-c commented 1 month ago

It seems to be caused by the configuration error of ray in kubernetes (e.g. the XX.XXX.X.X). Please check the ray_address field in your YAML configuration. I can't reproduce this problem on a regular cluster.

Fatima-0SA commented 1 month ago

It seems to be caused by the configuration error of ray in kubernetes (e.g. the XX.XXX.X.X). Please check the ray_address field in your YAML configuration. I can't reproduce this problem on a regular cluster.

The XX.XXX.X.X is the Local Node IP of the head node, which can be used when submitting jobs using the Ray Jobs CLI. However, for the process_on_ray demo I am submitting jobs with Ray Client by specifying ray_address: 'ray://host.docker.internal:10001' in the YAML configuration..

I have tried to run simple scripts for reading/ writing data on the cluster with the same ray address and it is happening. But I would like to utilize the ray_executer.py for even extra data processing.

pan-x-c commented 1 month ago

Does this error only occur in the Ray mode of data-juicer? You can try the following code to check your ray cluster.

import ray
ray.init('ray://host.docker.internal:10001')

If the error still occurs, something is wrong with your Ray cluster. You may need to seek help from the Ray community.

Fatima-0SA commented 1 month ago

Does this error only occur in the Ray mode of data-juicer? You can try the following code to check your ray cluster.

import ray
ray.init('ray://host.docker.internal:10001')

If the error still occurs, something is wrong with your Ray cluster. You may need to seek help from the Ray community.

Yes, I can confirm this one is working fine with my Ray Cluster, the problem arises only when attempting ray mode of data-juicer.

pan-x-c commented 1 month ago

The error occurs when loading the dataset, can you try the following code?

import ray
import ray.data as rd

ray.init('ray://host.docker.internal:10001')
dataset = rd.read_json(<your dataset path>)
Fatima-0SA commented 1 month ago

The error occurs when loading the dataset, can you try the following code?

import ray
import ray.data as rd

ray.init('ray://host.docker.internal:10001')
dataset = rd.read_json(<your dataset path>)

This will cause an error of file not found (not visible to the RayCluster), as the dataset path is local within the docker container -from which I am sending jobs to the RayCluster- please check out below: image

So instead, I am adding the runtime_env variable to upload my local working directory to the head node, as well as, adding the @ray.remote decorator right above the read_json() & write_json() functions:

import ray
import ray.data as rd
from loguru import logger

logger.info('Initiating RayCluster Connection:')
ray.init('ray://host.docker.internal:10001', 
         runtime_env={"working_dir": "/opt/airflow/dags"})

@ray.remote
def read_json():
    dataset = rd.read_json('demo-dataset.jsonl')
    return dataset

@ray.remote
def write_json(dataset):
    dataset.write_json('local:///tmp/processed_data')

logger.info('Trying to Read JSON Remotely...')
json_data = read_json.remote()
logger.info('Completed Reading JSON!')

logger.info('Trying to Write JSON Remotely...')
write_json.remote(json_data)
logger.info('Completed Writing JSON!')

Output can be seen at the ray head node /tmp/processed_data: image

pan-x-c commented 1 month ago

Currently, data-juicer's ray mode requires data to be stored on a shared file system (in this case, use the path directly) or in the same path on all machines (use the path with local://).

Fatima-0SA commented 1 month ago

Reading from local file with "local://" added to the path is not working properly, I keep facing FileNotFound error, though I copied the file on the same path for all of the cluster pods + the scheduling container. Trying to add the data path only without "local://" is raising the same error I started the issue with: RuntimeError: Global node is not initialized.

Similarly, the global node initialization error is encountered when trying to read from a shared storage (Azure Blob Storage), full log for your reference: image

pan-x-c commented 1 month ago

Since I can't reproduce the error, I can only make corresponding modifications according to the above code. You can try the code in PR #348 to confirm whether the problem has been solved.

Fatima-0SA commented 1 month ago

Thanks @pan-x-c , have tried to check out to PR #348, but still receiving the same exception: image image image

github-actions[bot] commented 2 weeks ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 2 weeks ago

Close this stale issue.