Roadmap after v0.2 - Githubissues

spirali commented 6 years ago

We have received many feedbacks from our reddit post (https://www.reddit.com/r/rust/comments/89yppv/rain_rust_based_computational_framework/). I think that now is time to recap our plans and maybe reconsider priorities to reflect real needs of users. This is meant as a kind of brainstorming white board; each individal task should have own issue at the end. Also, I would like to focus on a relatively short term road plan (let us say things that could be finished within 3 months), maybe we can create a similar post for our long term plans.

EDIT (gavento): Moved @spirali's TODO to a post below.

gavento commented 6 years ago

Based on the feedback from the mentioned Reddit discussion, our long-term goals and internal discussion, this is the list of issues to work on, their [priority], (asignee) and their sub-tasks.

Prioritized enhancements

Custom tasks (subworkers) in more languages

Requested by several people in the discussion, seems like a good idea anyway. For now with Capnp.

[x] Subworker registration infrastructure in the worker (@spirali) [high]
- [x] Replace the current DataStore API with direct calls. (@spirali) [high]
[x] Rust subworker as a library #34 (@gavento) [high]
- [x] Crates.io crate for easy build [high]
[x] C++ subworker as a library #35 (@spirali ) [high]
[ ] Python subworker as a library [low] (run standalone scripts as opposed to defining them in the client only)

Easier deployment in the cloud

[x] Deployment on Exoscale, and CloudStack in general (@vojtechcima) [high] (#37)
[ ] Deployment in the amazon cloud (@vojtechcima) [medium] (#37)

Packaging for easier deployment

Multiple options, priorities may vary. (@spirali)

[x] Changelog [high]
[x] PyPI packege for easy client installation [high]
- Needs a name (rain is taken). Suggestions?
[x] Crates.io package [medium]
- Needs a different name (rain is taken). Suggestions?
[ ] AppImage/Snap packages [low] (we already have static binaries)
- Snapcraft has a rust plugin
[ ] Deb/other distro packages [low]
- There is cargo-deb

Fix current bugs

[x] Resiliency against subworker and worker crashes (#39) (@spirali) [high]
[ ] #7 (occurs under heavy load only) [medium]
[ ] #13 (seems to be bound to Exoscale deployment) [high]

Improve Python API

Pythonize the client API.

[x] Tasks into a class hierarchy #9 (@gavento) [high]
[x] Improve Attributes API consistency and docs, WIP in #15 (@gavento) [medium]
[ ] Draft content-type loaders/extensions (@gavento) [low]
[ ] Task/object groups and names/labels (#32) [low]

Improve testing infrastructure

[ ] Scripts/containers/... to test deployment and running in a network. (@vojtechcima) [medium]
- Test rain start and running on OpenStack, Exoscale, AWS. Does not have to be a part of CI (even for running locally). Depends on / part of #37.

Client-side protocols

Replace capnp RPC and the current monitoring dashboard HTTP API with common protocol. Part of #11 (more discussion there) but specific to the public API.

[ ] Design the API calls (@gavento) [medium]
[ ] Implement in the server (@gavento) [medium]
[ ] Update in the Python API (using aiohttp for async API) (@gavento) [medium]
[ ] Update the dashboard (@gavento) [medium] (#38)

Improve the dashboard with more information and post-mortem analysis

[ ] Design and revamp the dashborad. Depends on the client API development (@gavento) [medium/low] (#38)
[ ] Include stats for task/object groups and possibly names/labels from #32 [low] (#38)

More real-world code examples

Lower priority, best based on real use-cases. Ideas: numpy subtasks, C++/Rust subworkers

Enhancements to revisit in the (not so distant) future

Integration with some popular libraries
- Apache Arrow content-type
- XGBoost tasks, etc ...
- Why not now: Not clear what would be the demand
Worker configuration files (needed for common (CPU) and special resources (GPU), different subworker locatins and configurations, ...)
- Why not now: Needs to be thought-through (esp. w.r.t. resources), not needed now
Separate session construction and running (save/load session)
- Why not now: Not clear what would be the use-cases, not difficult when API stabilized
Clients in other languages: Rust, C++, Java, ...
- Why not now: Not clear what would be the demand. Easier after the protocol/Python API stabilization.
Scale the scheduler, benchmarks
- There is a benchmark in utils/bench/simple_task_scaling.py. The results as of 0.2 are here.
- Why not now: While eventually crucial, the scheduler is sufficient when there are <1000 tasks to be scheduled at once.

gavento commented 6 years ago

@spirali's original TODO notes

First, I start with my todo list as looked like before the reddit post:

Basic resilience, so crashing of worker (or subworker) does not take down the whole infrastructure. It needs some cleanup in the server that already has begun.
Datastore revamp - The goal is to simplify data fetching API. The current implementation is the only serious usage of capnp's distributes objects. Even I really love the idea, this brings us problem if we want to support additional RPCs (i.e. REST api for clients) or even get rid of capnp. The current design also makes difficulties when we want to redirect fetching to a different source; that is necessary for resilience.
Possibility to just write in client "client = Client("my.env")" to connect Rain infrastructure, where "my.env" is a file created by our Exoscale scripts and (in the near future) by "rain start" + some additional tweaks to make starting infrastructure more easier.
Do some tweaks to Dashboard, especially when a session contains a large graph.

The list of items that was actually in our long term goals, but we should reconsider its priority.

Implement additional subworkers. It should be relatively easy to implement a simple library for e.g. Rust and C++ that provide basic subworker interface and allows to use simply use Rust/C++ code in Rain. There are some open questions how should API exactly look like, but prepaparing some initial prototype should be no problem. I think that only a real question is if we should wait for decesion about new RPC protocol. The good thing is that worker<->subworker is quite separated from other communication (because it referes to local objects - e.g. it does not even use data store API, hence we do not have wait for datastore revamp).
Implement additional clients. It seems that having non-Python clients is more popular than we expected. As far I know, we did not discuss this option too much; however create some working prototype should be relatively easy. However, the qestion about waiting for new RPC is more serious here.

spirali commented 6 years ago

Is "Python subworker as a library" necessary? I have the feeling that for each environment where we can transfer a function to a subworker from a client in reasonable (and portable) way then we should do it that way. The overhead of transferring a function is minimal (it is done only once) and flexibility is huge. I consider building a "fixed" subworker as a kind of side-step where there is no such option (C++/Rust [?*])

Having a possibility to annotate a rust function in a Rust client and send it to a generic Rust subworker would be a killing feature, but I do not know if this is possible

gavento commented 6 years ago

I can imagine some scenarios where a python worker could be useful:

When your task code is extensive and stable across pipelines. Now you need to somehow import it and let cloudpickle transfer it. (Not really elegant.)
When you have a Python 2.7 (legacy scientific code) or Pypy (for speed) code to be run. We can probably make the subworker Python2 compatible relatively easily if we want.
When you have cython or other binary extensions (does cloudpickle transfer them reliably / at all?). Even if you deploy the binaries manually, you may have a different python ABI version on the workers than on your client.

Also, the built-in pytask subworker can be trivially implemented as one such subworker task (with a bit of unpacking logic) and so it is not much more work.

spirali commented 6 years ago

I think that even you have a stable pipeline then sending it to subworkers costs nothing (especially compared to the fact that you use Python) and setting up the infrastructure with own subworkers is always more painful (admin work, changes & updates).
Pypy actually works right now - there is no problem to cloudpickle an object in CPython and unpickle it in Pypy. Capnp works in pypy.
Cloudpickle transports Python bytecode not binary code, so if you have a library installed on both ends, there is no ABI problem.

However, I see now that it can be useful to define tasks that may be called e.g. from Java client where cloudpickle is not easily accessible.

spirali commented 6 years ago

PR #40 implements replacement of DataStore API with direct calls

vojtechcima commented 6 years ago

PR #52 implements Exoscale deployment scripts.

yingfeng commented 6 years ago

These are what I think useful in future:

Being able to be orchestrated by Kubernetes
Provide compatible api with popular python libraries, including sklearn, numpy, and pandas, for example, some ideas from dask could be borrowed, which is a distributed data process framework with those compatible apis. However, dask suffers from performance as well as fault tolerancy. As @gavento listed above, integration with XGBoost or other machine learning libraries are required in future, however, the integration would not be that useful without such design which would be used to serve as the data pre-processing for those machine learning libraries. On the other hand, spark is a good contender

gavento commented 6 years ago

Transitioned to #64 after v0.3 release

substantic / rain

Roadmap after v0.2 #26

Prioritized enhancements

Custom tasks (subworkers) in more languages

Easier deployment in the cloud

Packaging for easier deployment

Fix current bugs

Improve Python API

Improve testing infrastructure

Client-side protocols

Improve the dashboard with more information and post-mortem analysis

More real-world code examples

Enhancements to revisit in the (not so distant) future

@spirali's original TODO notes