project-codeflare / codeflare

Simplifying the definition and execution, scaling and deployment of pipelines on the cloud.
https://codeflare.dev
Apache License 2.0
218 stars 35 forks source link

CodeFlare resiliency tool: initial commit #37

Closed JainTwinkle closed 2 years ago

JainTwinkle commented 3 years ago

What does this PR do? This is first step towards improving resiliency and performance in Ray without modifying the source code. This PR includes a new tool that helps configure Ray cluster conveniently. The tool helps in fetching and parsing ray configurations, and generating resiliency profiles (e.g., strict, relaxed, recommended). Currently, we are working on deciding configuration options for each resiliency profile manually by evaluating them on various ray workloads. We'll update this PR accordingly.

Description of Changes The changes in this PR is currently independent of the main codeFlare code. We intend to put this tool in a new folder called utils in the codeFlare root directory.

JainTwinkle commented 2 years ago

@chcost could we assign someone to this PR review?

Thanks!

raghukiran1224 commented 2 years ago

@JainTwinkle please let me know re above comment