Fatima-0SA commented 1 month ago

Before Reporting 报告之前

[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

from source, for distributed processing

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.10.14

Describe the bug 描述这个bug

I have created a RayCluster deployed on Kubernetes within a VM -ray version 2.31.0-. I have forwarded the head ports:

10001 (for Ray Client connection) and,
8265 (for the dashboard view).

I am trying to run the process_on_ray demo from a local docker container that has data juicer installed on it. For processing on ray, I have only changed the ray address from 'auto' to: ray_address: 'ray://host.docker.internal:10001' and reduced the processing steps in the demo to include only average_line_length_filter.

Running the above, I managed to get the connection initialization successfully, and I can see the jobs are scheduled within the ray dashboard, however, when it comes to reading the JSON file as a dataset, and trying to get the columns of the dataset, I face RuntimeError: Global Node is not initialized. This applies to any kind of processing step I want to perform on the dataset object.

Any insight about the global node initialization error? or is there any specific config. to be done when deploying the RayCluster?

To Reproduce 如何复现

Deploy RayCluster on kubernetes
Forward the head pod to port 10001
Create a Docker container that has data-juicer
Change process_on_ray configuration to be similar to the config mentioned below
Run command from the docker container: python data_juicer/tools/process_data.py --config data_juicer/demos/process_on_ray/configs/demo.yaml

Configs 配置信息

Process config example for dataset

global parameters

project_name: 'ray-demo' executor_type: 'ray' dataset_path: './data_juicer/demos/process_on_ray/data/demo-dataset.json' # path to your dataset directory or file ray_address: 'ray://host.docker.internal:10001' # change to your ray cluster address, e.g., ray://: export_path: './outputs/demo/demo-processed'

process schedule

a list of several process operators with their arguments

process:

average_line_length_filter: # filter text with the average length of lines out of specific range. min_len: 10 # the min length of filter range max_len: 10000

Logs 报错日志

No response

Screenshots 截图

Additional 额外信息

No response

drcege commented 1 month ago

@pan-x-c Perhaps related https://github.com/ray-project/ray/issues/41333

Fatima-0SA commented 1 month ago

When I downgraded to ray==2.10.0 (to match the version specified in /data_juicer/environments/dist_requires.txt), I started to receive another error related to GCS:

2024-07-09 10:25:49 | INFO     | data_juicer.core.ray_executor:113 - Loading dataset with Ray...
[2024-07-09 10:26:05,833 E 4398 4398] gcs_rpc_client.h:212: Failed to connect to GCS at address XX.XXX.X.X:6379 within 5 seconds.
[2024-07-09 10:27:00,872 E 4398 4519] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.

pan-x-c commented 1 month ago

It seems to be caused by the configuration error of ray in kubernetes (e.g. the XX.XXX.X.X). Please check the ray_address field in your YAML configuration. I can't reproduce this problem on a regular cluster.

Fatima-0SA commented 1 month ago

It seems to be caused by the configuration error of ray in kubernetes (e.g. the XX.XXX.X.X). Please check the ray_address field in your YAML configuration. I can't reproduce this problem on a regular cluster.

The XX.XXX.X.X is the Local Node IP of the head node, which can be used when submitting jobs using the Ray Jobs CLI. However, for the process_on_ray demo I am submitting jobs with Ray Client by specifying ray_address: 'ray://host.docker.internal:10001' in the YAML configuration..

I have tried to run simple scripts for reading/ writing data on the cluster with the same ray address and it is happening. But I would like to utilize the ray_executer.py for even extra data processing.

pan-x-c commented 1 month ago

Does this error only occur in the Ray mode of data-juicer? You can try the following code to check your ray cluster.

import ray
ray.init('ray://host.docker.internal:10001')

If the error still occurs, something is wrong with your Ray cluster. You may need to seek help from the Ray community.

Fatima-0SA commented 1 month ago

Does this error only occur in the Ray mode of data-juicer? You can try the following code to check your ray cluster.
import ray
ray.init('ray://host.docker.internal:10001')
If the error still occurs, something is wrong with your Ray cluster. You may need to seek help from the Ray community.

Yes, I can confirm this one is working fine with my Ray Cluster, the problem arises only when attempting ray mode of data-juicer.

pan-x-c commented 1 month ago

The error occurs when loading the dataset, can you try the following code?

import ray
import ray.data as rd

ray.init('ray://host.docker.internal:10001')
dataset = rd.read_json(<your dataset path>)

Fatima-0SA commented 1 month ago

The error occurs when loading the dataset, can you try the following code?
import ray
import ray.data as rd

ray.init('ray://host.docker.internal:10001')
dataset = rd.read_json(<your dataset path>)

This will cause an error of file not found (not visible to the RayCluster), as the dataset path is local within the docker container -from which I am sending jobs to the RayCluster- please check out below:

So instead, I am adding the runtime_env variable to upload my local working directory to the head node, as well as, adding the @ray.remote decorator right above the read_json() & write_json() functions:

import ray
import ray.data as rd
from loguru import logger

logger.info('Initiating RayCluster Connection:')
ray.init('ray://host.docker.internal:10001', 
         runtime_env={"working_dir": "/opt/airflow/dags"})

@ray.remote
def read_json():
    dataset = rd.read_json('demo-dataset.jsonl')
    return dataset

@ray.remote
def write_json(dataset):
    dataset.write_json('local:///tmp/processed_data')

logger.info('Trying to Read JSON Remotely...')
json_data = read_json.remote()
logger.info('Completed Reading JSON!')

logger.info('Trying to Write JSON Remotely...')
write_json.remote(json_data)
logger.info('Completed Writing JSON!')

Output can be seen at the ray head node /tmp/processed_data:

pan-x-c commented 1 month ago

Currently, data-juicer's ray mode requires data to be stored on a shared file system (in this case, use the path directly) or in the same path on all machines (use the path with local://).

Fatima-0SA commented 1 month ago

Reading from local file with "local://" added to the path is not working properly, I keep facing FileNotFound error, though I copied the file on the same path for all of the cluster pods + the scheduling container. Trying to add the data path only without "local://" is raising the same error I started the issue with: RuntimeError: Global node is not initialized.

Similarly, the global node initialization error is encountered when trying to read from a shared storage (Azure Blob Storage), full log for your reference:

pan-x-c commented 1 month ago

Since I can't reproduce the error, I can only make corresponding modifications according to the above code. You can try the code in PR #348 to confirm whether the problem has been solved.

Fatima-0SA commented 1 month ago

Thanks @pan-x-c , have tried to check out to PR #348, but still receiving the same exception:

github-actions[bot] commented 2 weeks ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 2 weeks ago

Close this stale issue.

modelscope / data-juicer

Unable to process data with ray executor on Kubernetes Ray Cluster: #345

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Process config example for dataset

global parameters

process schedule

a list of several process operators with their arguments

Logs 报错日志

Screenshots 截图

Additional 额外信息