[tune][Bug] Worker doesn't sync the logs to HDFS at the given interval

Nithanaroy commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Tune, Ray Clusters

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

tune.run(
    ...
    checkpoint_freq=0,
    sync_config=SyncConfig(
        # upload_dir should exist and will not be created on the fly
        upload_dir="hdfs://" + hdfs_sync_dir,
        sync_on_checkpoint=True,
        sync_period=60
    )
)

doesn’t not sync the data from workers to HDFS every 60s. It however syncs the logdir from the head node every 60s. But at the end of the experiment, it pushes all data from workers to HDFS as requested.

Versions / Dependencies

1.10.0 version of ray and tune everywhere

Reproduction script

Unfortunately I dont know of any open source way to reproduce multi worker problem like this. I started head using ray start --head and connected a bunch of workers to it using ray start --address=.... And used Tune to launch an experiment.

Anything else

I'm happy to jump into a debug session if it is easier for you

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Nithanaroy commented 2 years ago

@krfricke I did not find anything related to HDFS sync failures in /tmp/ray/session_latest/logs/ directory while the trail is running

krfricke commented 2 years ago

We've updated the syncing logic to use pyarrow for syncing instead - please let us know if this resolved the problem.

Nithanaroy commented 2 years ago

That’s great, @krfricke How do I get this change? I can wait for the next release if it’s easier

richardliaw commented 2 years ago

You should try out pip install --pre -U ray which should be in the 2.0 release :)

yiwei00000 commented 1 year ago

Hi： What settings are required for ray to read hdfs ？Look forward to your reply, thank you.

ray-project / ray