Ray project suggestions

pcmoritz commented 5 years ago

Ray is a community-driven project. We love to learn about existing use cases and how we can help to make Ray more useful. To facilitate that, we would like to create a community-maintained list of project suggestions that can help future contributors decide on what to work on. To facilitate the discussion, here is a preliminary list.

If you are interested in working on any of these or have more suggestions for projects, please comment on this thread, open issues or post to our mailing list!
Please also check out https://github.com/ray-project/ray/issues

Ray core

These are suggestions to improve Ray's distributed execution engine.

Improve task submission overhead: Profile the current task submission overhead, identify bottlenecks and speed it up. This could for example be done by batching tasks submissions.
Fuzzing for Ray: Automatically uncover bugs and race-condition in Ray using fuzzing.
Code coverage: Track the code coverage of Ray tests (and improve it).
Simplified microservices: Make it very easy to develop and deploy microservices with Ray (automatically creating REST/GRPC interfaces for actors, making it possible to support dockerized actors)
C++ frontend: We already have a frontend for Python and Java. This project entails adding a frontend for C++, which could be useful for certain performance-critical applications like allreduce.
Distributed GC: An alternative design for object eviction. Instead of doing LRU eviction on a per-node basis, implement a more global policy that tracks object usage and frees objects if they are not needed any more.
Actor migration: Make it possible to transfer actors from one node to another. This will enable preemption of nodes with actors.
Operator or task fusion: Speed up the execution of Ray programs by automatically fusing together operators or tasks, e.g. for streaming.

Libraries

These are projects related to libraries.

Data and model parallel trainig: Develop a library to do data and/or model parallel training on Ray
Tune:
- Optimize training for TPUs
- Add new architecture search algorithms
- Add components for diagnostics during training
- Warm-starting a model
- Exposing a first-class tracker/logger
RLlib:
- Add new models (e.g. Deep ResNets or Transformers)
- Add new tasks/examples (e.g. language based ones)
- Improve PyTorch support
- TensorFlow 2.0 support
Modin:
- Automatically pick number of partitions
- Speed up operators with Apache Arrow
Streaming: Implement more operators and improve the performance of https://github.com/ray-project/ray/pull/4126.

Applications

This is a list of interesting applications that can be developed on top of Ray. They can serve as examples for others on how to use Ray or evolve into libraries in the future.

Web-crawling: Use Ray to extract information from the web (e.g. a search index) by crawling web-pages.
Training a language model: Extract training data from the web and train a language model with it, see also https://openai.com/blog/better-language-models/.

Development

These are projects that will make it easier to do development with Ray.

Integration with Debuggers: Make it easy to do remote debugging or actors and tasks in Ray.
Integration with IDEs: E.g. write a plug in for VSCode that integrates with the graphical debugger, shows the task timeline or lets users easily start/stop clusters and update their code.

gehring commented 5 years ago

@pcmoritz This is a ~~create~~ great idea! It might be good to include a project for supporting TF 2.0 (related #4134)

pcmoritz commented 5 years ago

@pcmoritz This is a create idea! It might be good to include a project for supporting TF 2.0 (related #4134)

Thanks, added it!

raulchen commented 5 years ago

Thanks for summarizing and posting this!

I'd like to share some experiences and work from Ant.

Improve task submission overhead: Big +1 for this. We're also planning on profiling and improving Ray's performance. One thing we already did is perf metrics, #4246 is first PR and other PRs should also come soon. Besides this, there're a lot of other things that need to be built. E.g., distributed tracing, profiling CPU/memory usage, etc.

Code coverage: I personally did some research before about adopting codecov.io. The amount of work should be fine. Maybe someone from Ant can work on this.

Distributed GC: Months ago, we discussed about Batch GC. And we're now prototyping this idea. Other than Batch, do we have a better solution (e.g., automatic distributed GC) at this moment?

Other than the above, I think "Custom task/actor scheduling policy" will also be very useful. E.g., Streaming system needs this feature to collocate actors.

jarlva commented 5 years ago

Cross-pollinate project with tensorflow/agents. Hi All, tensorflow/agents is "A library for Reinforcement Learning in TensorFlow." It's in active development and also compatible with TF 2.0 I believe that cooperation between the the two projects will rip lots of fruits. Thoughts?

gravitywp commented 5 years ago

make actor's method support async function to improve concurrency.

federicofontana commented 5 years ago

Visualizations: All metrics are saved as scalar under the same tab in tensorboard, that's it (no histograms, no graphs, no HParams for tune). It would be nice to add more visualizations, for example:

Use different tab names for different metrics (e.g. loss, hyperparameters, ...)
Add computation graph in tensorboard (this might only require to pass graph=tf.get_default_graph() when instancing FileWriter here.
Add L2 norm of network weights and/or gradients in tensorboard to diagnose training issues.
Use tensorboardX to visualize non-tensorflow models.
Support tensorboard-HParams to visualize tune results #4528

jodusan commented 5 years ago

Would a PR that adds beholder tensorboard plugin be of any interest?

alokgogate commented 5 years ago

Would it be possible to add a progress bar page like Spark? This would be make it easy to track the status of any job that has been deployed on the ray cluster?

bionicles commented 5 years ago

NEAT / HyperNEAT algorithms might benefit from Ray scaling

drozzy commented 5 years ago

Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that. I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).

kivo360 commented 5 years ago

@drozzy I personally use a microservice for better object permanence and to keep the ray code separate from the rest of the codebase. It allows big projects to be worked on without creating a massive monolith (separate repos + containers = godsend).

Also, question. What's the status of putting a transformer inside of the model? I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?

richardliaw commented 5 years ago

I found something on Github that seems to be a meta-learner for RLLib, maybe you can take what they did?

@kivo360 what do you mean? can you share a link?

kivo360 commented 5 years ago

@richardliaw my bad, proofreading error. "Someone on GitHub has a meta-learner RL model. Maybe we can take what they created and turn it into a default."

gravitywp commented 5 years ago

Not sure why you need "Simplified microservices" - seems like a lot of extra work for little payoff. There are already great languages like Erlang/Elixir for that. I would be really interested to see more straightforward distributed-RL setup - maybe more standardized k8s approach (currently worker/head nodes have to be setup manually, which includes a lot of setup of libraries to make sure that all software is there).

@drozzy Maybe Ray could provide more flexibility than normal microservice since Ray support fine-grained task will be able to do function-level scaling, and you even can write a whole distributed application in one Ray project(orchestrate a branch of components on different nodes). Personally, I think it would be great to have this feature.

kuonangzhe commented 5 years ago

Are we going to support kubeflow?

anooprh commented 4 years ago

Are we going to support kubeflow?

I second supporting kubeflow as well. kubernetes is the most popular cluster management system and leveraging kubeflow + kubernetes would make it easy for folks to leverage their existing cluster to use ray.

josjo80 commented 4 years ago

I would like to propose supporting self-play algorithms like AlphaGo, AlphaZero, or MuZero. The following article provides pseudo-code for a MuZero implementation.

robertnishihara commented 4 years ago

@kuonangzhe @anooprh can you say more about what the ideal integration/API would look like? Thanks!

mstrofbass commented 4 years ago

I can't be the only one who would find this useful so maybe I'm just too unfamiliar with Ray to know how to accomplish the same thing, but I'd simply like the ability to "disable" ray.

A lot of times when I'm debugging I end up removing the decorator and calling the method I'm debugging directly, which also requires changing how function parameters are handled and the output from the function calls (e.g., can't use ray.get() anymore).

It would be nice if I could use a config option to essentially tell Ray to not do any of the fancy stuff and just basically do normal synchronous processing (i.e., implement a passthrough mechanism).

A potential use case for this, feasibility unknown, would be facilitating usage on Windows. In theory, you could have a Windows build that implements this passthrough mechanism so that Windows users can at least run the same code even if they don't get the benefits. I presume this would be easier to implement than implementing the full functionality.

I'm getting a buddy that runs Windows to help me on a project that uses Ray. He doesn't actually need the benefits of Ray to do his thing, but it would be great if he could simply run the code as is.

richardliaw commented 4 years ago

I think one way of achieving this is via ray.init(num_cpus=1). The other way of achieving this should be ray.init(local_mode=True), though I think there are a few small known bugs with that option.

On Wed, Jan 8, 2020 at 5:34 PM mstrofbass notifications@github.com wrote:

I can't be the only one who would find this useful so maybe I'm just too unfamiliar with Ray to know how to accomplish the same thing, but I'd simply like the ability to "disable" ray.

A lot of times when I'm debugging I end up removing the decorator and calling the method I'm debugging directly, which also requires changing how function parameters are handled and the output from the function calls (e.g., can't use ray.get() anymore).

It would be nice if I could use a config option to essentially tell Ray to not do any of the fancy stuff and just basically do normal synchronous processing.

A potential use case for this, feasibility unknown, would be facilitating usage on Windows. In theory, you could have a Windows build that implements this passthrough mechanism so that Windows users can at least run the same code even if they don't get the benefits. I presume this would be easier to implement than implementing the full functionality.

I'm getting a buddy that runs Windows to help me on a project that uses Ray. He doesn't actually need the benefits of Ray to do his thing, but it would be great if he could simply run the code as is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/4417?email_source=notifications&email_token=ABCRZZMS2ZITG3EJ7YKEBEDQ4Z5ILA5CNFSM4G7TZ2R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOTLOA#issuecomment-572339640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZOPIJVGO7BTKDSK5PDQ4Z5ILANCNFSM4G7TZ2RQ .

mstrofbass commented 4 years ago

I think one way of achieving this is via ray.init(num_cpus=1). The other way of achieving this should be ray.init(local_mode=True), though I think there are a few small known bugs with that option. …

It looks like local_mode is it!

So we can update my request to basically be: would it be possible to get a Windows build that implements local_mode quicker than a Windows build that does everything?

bek0s commented 4 years ago

Hi all,

First of all, thank you for this great framework. I was wondering if there is a dask.distributed.Client-equivalent in Ray.

The reason for asking about this is the following scenario: Imagine that you have a supercomputer with two different kinds of nodes (CPU-based, and GPU-based), and the administrators have setup two separate ray servers for each kind of node.

I am developing an astrophysics code which evaluates a series of galaxy models simultaneously. The software provides the users with the option to choose what kind of hardware they want to evaluate each of their models on (CPU, GPU, etc). Imagine now that a user wants to (simultaneously) evaluate some models on the CPU and some on the GPU, in the environment like the one described above. I would like to be able to connect to two different ray servers and perform my calculations simultaneously.

To my limited knowledge, Ray doesn't support this because the connection to the server is a global state in the framework. Is that true? Are there any plans to support simultaneous connections to different Ray servers?

Thank you for your time.

robertnishihara commented 4 years ago

@bek0s, I see, so the thing you want to do is to have an application that submits different tasks to different Ray clusters, right?

There are two parts to this.

Allowing you to have a Python script connect to a Ray cluster from outside of the Ray cluster. We are planning to do this, but it isn't implemented yet. There was some preliminary work on this a while back in https://github.com/ray-project/ray/pull/2478, but it never got merged.
Allowing you to connect to multiple clusters. The current Ray API is not designed for this, because you call ray.put() and f.remote() and things like that which don't specify a cluster. It's certainly possible to implement something like this. Right now, the preferred way to do this in Ray is to have a single cluster with CPUs and GPUs and to specify in the application whether the tasks should use CPUs or GPUs.

bek0s commented 4 years ago

Hi @robertnishihara,

I really appreciate the quick response. It is good to know that the features I would like to see are not impossible to implement due to some fundamental limitation of Ray. Indeed, my use case is quite unusual, and I think the current Ray API design should suffice for most cases. Nevertheless, any future developments related to the above-mentioned features will be more than welcome! :)

Thanks again!

bionicles commented 4 years ago

one critical thing missing from Ray versus Multiprocessing is Queues and Pools, just a simple API to set up an endless loop like this:

Envs (Pool) -> Observations (Queue) -> Agents (Pool) -> Actions (Queue) forever

This Pool, Queue, Pool, Queue motif takes no time in multiprocessing but it's unclear in ray and often just hangs with no error messages or anything. That's bread and butter basic stuff for a distributed systems framework, but it's not stable reliable benefit for Ray users. Just imagine a Kanban board. It's really an async pipe of pools and queues

Even just making logs requires stack overflow to find some function to build a logger on all the workers.

Most of the intro to ray docs are oversimplified to the point they aren't useful; for example, the functions in the examples take no arguments so it's not immediately clear to a new user that you're supposed to do x.y.remote(ARGS)

Also, this library is huge, complicated, and the dependencies are huge and complicated, to the point I'm concerned about adopting Ray, it's literally 380,000+ lines black box beast, not saying it could be done better, but it could definitely be a lot simpler, and that would make maintenance a lot easier. Simplicity is a key benefit of good software, and Ray's core API seems simple, but the implementation is complicated and that holds it back

richardliaw commented 4 years ago

Thanks a bunch for the feedback @bionicles! BTW a question about dependencies - what would be ideal here? Reducing extraneous dependencies in a slimmed-down core install? Moving away from the monorepo?

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray