ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.54k stars 5.69k forks source link

Discussion on batch Garbage Collection. #2242

Closed guoyuhong closed 4 years ago

guoyuhong commented 6 years ago

Hi @robertnishihara @pcmoritz , we are planning to add a batch Garbage Collection to Ray.

We have a concept called batchId (int64_t) used to do the Garbage Collection. For example, one job will use this batchId to generate all the objectIds and taskIds, and all these objectIds and taskIds will be stored under the Garbage Collection Table under the batchId in GCS. When the job is finished, we can simply pass a batchId to the garbage collector and the garbage collector will look up the Garbage Collection table in GCS and do the garbage collection to all the related tasks and objects.

In current id.h implementation, the lowest 32 bits in ObjectId is used for Object Index. We can use the higher 64 bits next to the Object Index as the batchId and add a new GC Table in GCS.

This GC mechanism will help release the memory resources in GCS and plasma. How do you think of this code change?

guoyuhong commented 6 years ago

@eric-jj @raulchen @imzhenyu . FYI.

eric-jj commented 6 years ago

We used the task_id(part of the objectID) to GC objects for a task in ant finance internal version, which gets very promising results for the plasma object management. I am thinking to migrate the feature to the open source version, so we raise the issue. @pcmoritz @robertnishihara , do you have any comments on it?

robertnishihara commented 6 years ago

Can you post a code example showing what this would look like from the application level? Or would this be completely under the hood?

guoyuhong commented 6 years ago

I will post a code review as an example shortly. Thank you Robert.

robertnishihara commented 6 years ago

Thanks! I have some reservations about this approach and how general it is. It clearly makes sense for at least one of your use cases, but we need to make sure we have a strategy that works in general.

cc @concretevitamin

guoyuhong commented 6 years ago

Here is the user code change that our scenario needs. https://github.com/ant-tech-alliance/ray/pull/29 In this code change, there is no void gc(Long batchId); in RayApi.java, but our target is to add it.

guoyuhong commented 6 years ago

In the user side, here is a sample of the user code. The user wants to do the GC to release the Object memory for the finished jobs.

public RayJobScheduler {
    public void init(JobSchedulerContext jobSchedulerContext) {
        // Set the job id. Do some init work...
    }
    public void scheduleJob(IRayaJobDriver driver) {
        // Set the job id. Do some init work.
        Ray.startBatch(jobid,...);
    }
    public void finishJob(Long jobId, JobRes jobRes) {
        // Do some cleaning work...
        Ray.endBatch(...);
        if (job.Res.doGc) {
            Ray.gc(jobId);
        }
    }

}

Here, we are referring JobId and BatchId as the same concept which means a whole job/procedure/experiment that generating many ray tasks and objects to achieve a goal. And the job finishes and we have the result, we know that the resources in the job are no longer needed, so we can release the resources for later and continuous job requirements.

eric-jj commented 6 years ago

For the python code, user application will be the following code. The jobID is an option one, it will not affect the existing feature/code. Only user want to manage the task and object life cycle by himself, he can use the start_batch and end_batch to do GC.

@ray.remote def f(): time.sleep(1) return 1

ray.init()

jobID = ray.start_batch(); results = ray.get([f.remote() for i in range(4)]) ray.end_batch(jobID, need_gc = true);

concretevitamin commented 6 years ago

Hey @guoyuhong @eric-jj - I think we have been conservative in two things

So I think it'd be great to have a deeper discussion on this proposed feature, perhaps in the form of a design doc? @robertnishihara @pcmoritz what do you guys think?

To clarify, we've been working on addressing GCS memory issue as well. The difference is our solution's more transparent to the user, while still being possible to call directly by the user should she choose so. Each GC batch is not tied to a job. When it's ready, I'd love to get you guys to play with this feature and see if it meets your needs!

eric-jj commented 6 years ago

@concretevitamin The feature is not for the GCS memory, it is for the plasma memory management. It is an important feature for our production system, without the feature the unused objects will not be recycled in plasma, and useful object may be evicted because of the lifetime. We implement the feature because we met the problem when we use ray in our system. Our scenario is a graph computation, there are many temp variables during the calculation. So we implemented the feature on our internal version, and got the expected results. The system is online now.

I am trying to merge the internal version ray and github version. The first step to me is to deploy the github ray to our scenario. The feature is a blocking one for it, we can't apply the github version ray to our online system.

@guoyuhong will write a design doc and we can follow up the detail design. To the design, it will introduce a top level API, it is expected. Some scenarios need the RAII ability, user can choose to use it or not. And I don't think it will exposing internal control to user, how to implement the GC logical is transparent to user, we can do nothing although user mentioned that.

What's your plan to resolve the GCS memory issue, can you provide more details like design doc. We can evaluate it resolve our problem or not.

guoyuhong commented 6 years ago

Background

Graph Computing

We have graph computing scenario based on Ray. Ray can be used as the backend for the continuous graph computing requests. The requests have the following characteristics:

  1. In the scenario of graph computing, object number grows exponentially. If unused memory is not release in time, the memory will be a concern.
  2. There are many requests at the same time and there are continual requests all the time.
  3. The computation costs for different requests are different. It is not guaranteed that the finishing order is the same as the creation order.

Currently, the Plasma Store will evict the objects according to creation time when the memory is not enough. However, this mechanism is not enough. For one thing, our scenario needs large amount of memory, so it won't take long to use up the memory with incoming requests. For another, if an object is evicted according to creation time while it is still in use for later tasks, there will be a reconstruction of the object, which will affect the performance. Since we have the knowledge which requests are finished and won't be used in the future, we can release the resource safely.

Streaming Computing

In the further evolution, Ray will support stream computing. The life cycle for streaming objects are different. Even the best automatic garbage collecion may not be enough. We need to provide the elaborate memory management like RAII to enable users to manage object life cycle according to their needs.

Solution

We introduce a concept of Batch and BatchId (int64_t). It is an identifier for a set of Ray tasks and Ray objects related to one high-level request/job concept. After the high-level request/job finished, there is a new API in Ray for users to release all the resources from GCS and Plasma according to one BatchId.

Implementation Design

API Level

# APIs:
# This function can be called after Ray.init().
# If this function is called, there is a batch concept in this Ray Driver. 
# If not, Ray will work as normal.
ray.start_batch(batch_id) 

# This function can be called by the users 
# if they think all their resources are not needed in the future.
ray.end_batch(batch_id, need_gc = true)  

###########################################################
# Scenario 1: Normal Mode
@ray.remote
def f():
    time.sleep(1)
    return 1

ray.init()
results = ray.get([f.remote() for i in range(4)])

# Scenario 2: Batch Mode
@ray.remote
def f():
    time.sleep(1)
    return 1

ray.init()
batchId = ray.start_batch()
results = ray.get([f.remote() for i in range(4)])
# Some other work...
ray.end_batch(batchId, need_gc = true)

The Java API can be design in a similar way.

ObjectId Bits

Currently, the lowest 32 bits are used for Object Index. The full higher 128 bits are used for TaskId. We can further split the 128 bits into two parts. The highest 64 bits can be used for BatchId. image

Backend Support

GCS Support

There will be a GCS table to record the ObjectIds and TaskIds under one BatchId if the Batch mode is enable. The table name is "GC:${batchId}". Every objectId and taskId will be pushed into this table. After garbage collection is called with batchId, the GCS will release all the objectIds and taskIds under this table.

Plasma Support

Node manager will subscribe the delete message of the "GC:${batchId}" from GCS. If GCS has deleted the garbage collection table, the node manager will trigger a garbage collection request to Plasma. Plasma will go through all the objectIds and delete the objects under the requested batchId.

eric-jj commented 6 years ago

@robertnishihara @concretevitamin any comments?

robertnishihara commented 6 years ago

@guoyuhong thanks for the code example! @eric-jj, thanks, I am still thinking through the implications.

Is there a clean way to implement this as a helper utility outside of the backend? E.g., in Python/Java you can track all of the object IDs that are generated between start_batch and end_batch (and use the task table to find tasks that were spawned by those tasks), and then generate new tasks that run on each node and use the plasma client API to evict the relevant IDs.

The advantage of doing it this way is that we can avoid adding functionality to the backend and also avoid adding extra information to the ObjectIDs.

robertnishihara commented 6 years ago

Is it accurate to say that the core issue is that the object store eviction policy is not good enough for some scenarios like you're streaming use case and that you want to provide application-level knowledge to improve the eviction strategy?

eric-jj commented 6 years ago

@robertnishihara your solution can solve some problems, but it can't meet the requirments.

  1. It can be done partial as your suggestion, we can recycle the objects, but can't recycle tasks, the taskid is transparent to application.
  2. It is a general requirement, not for an application, the solution will make every application to invent wheel by themselves in different languages(python, java and ...).
  3. The application code can't touch the GCS, user have to create an actor to record the object id. The actor will become very heavy, when the actor is failover, all the calculation must be re-do.
  4. The biggest issue to me is it will make the code ugly. User must record the every objectID by themselves. In most case, user don't care the object ID, just pass the objectID between functions directly.
raulchen commented 6 years ago

Is it accurate to say that the core issue is that the object store eviction policy is not good enough for some scenarios like you're streaming use case and that you want to provide application-level knowledge to improve the eviction strategy?

@robertnishihara extactly

raulchen commented 6 years ago

I think currently Ray is still lack of a good resource management ability. For GCS, its memory is never recycled. For object store, it simply adopts LRU-based eviction now, which definitely isn't the most effective approach.

Thus, I think we can consider introducing a Resource Manager component into the core. It takes responsibility for managing the life cycle of all resources in the system, including GCS, object store, etc.

I haven't thought about the detailed design of the Resource Manager. But I think having a component that unifies resource management will be great.

Also, I'd like to hear @concretevitamin's plan to address GCS memory issue.

eric-jj commented 6 years ago

@robertnishihara For the proposal, it is not only for the GC management also. It is a starting point for the job level management. When I talked with ant finance's AI team, they have strong feature requirement for the job level management, such as observe the job status, cancel or re-submit a job. The job level concept/definition is a missing piece in the ray stack. GC algorithm is important, but job level resource management is also required. Without it, it is hard to provide ray service to many AI engineers to work together. When a java application exit, all the object created by it will be recycled by the OS also. We need the similar ability, when a job(just like application in java) complete its life cycle, return the resource back.

istoica commented 6 years ago

@guoyuhong thanks for raising these issues. These are great points to discuss.

First, regarding a job API. What are the requirements? Something like Spark submit (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-submit.html)? What else?

Second, regarding object eviction from the object store, it would be great to understand when LRU is not good enough. What workload (it looks to me that for the graph workload LRU should be fine, but maybe I'm missing something) ?

Of course, by LRU I mean that the objects will be evicted according with the times since they've been last accessed (red/write). And if we have workloads for which LRU is bad, one question here is whether configuring the eviction policy per driver/job would be good enough.

Also, related to the above we need to add the ability to spill over the objects to a persistent store. If you guys have some thoughts here please let us know. Also, this would be a great task to collaborate or maybe for you to lead.

eric-jj commented 6 years ago

@istoica For the job API, the Spark submit is a similar one. And Ray should provide a http server to user can submit, view, cancel the jobs. Another required important feature about job is separate resource manager for different user to avoid one user occupy all the resources in a cluster. If the feature(resource isolation) is not a general one, we need the interface in the job management module to let us apply our internal policy.

For the LRU, different job's life cycle is different, many small jobs create many temp variables and exit, and few large job run a long time. So the objects created by the long run job will be created will be deleted because its' life time is longer than the objects' created by the small jobs.

We did consider to create a new GC algorithm to resolve the issue. Ray has provide the interface, it will modify the code less. But we can't find a good solution to our scenarios. LRU is a GC algorithm, the GC algorithm is not better than java GC algorithm. It can't provide the scope resource management just like RAII in c++ and using in C#.

If you have concern on add top level API for RAY, with the job level support, we can add a flag to submit to let user decide recycle the objects created in the job or not. It is more nature to user, and will not expose the details in the job.

Spill over the objects to a persistent store is an option, but it can't resolve the issue because many temp variables are useless when a job exit in graph computation. The objects inflation can't be avoided in graph computation. In our case, 1 node will access the 300 neighbor nodes, it will be done for 3 times, which means 300^3 temp variables will be created. After a job exit, we don't want to maintain these objects.

And in our current thinking, spill over the objects to persistent store need job level support. Retired objects in a job will be put to the persistent store as a group, and we will not keep entries for these retired objects in the plasma store. We also are trying to apply the ray graph to online calculation scenarios now, the perf of it is very sensitive, we only have less than 300ms budget, large page out and page in to persistent store is not acceptable.

imzhenyu commented 6 years ago

@istoica LRU is not enough because the storage unnecessarily caches a lot of objects which brings in significant memory/storage/computation overhead in our long running service case with small jobs. We therefore introduce the above batch based approach in our internal version, and it is now running in production.

However, we do find limitations though for this approach as it does not allow interactions among different jobs (e.g., requests), and it does not support big job nicely (e.g., when the workload does not fit into total memory capacity). We therefore consider new distributed garbage collection solutions as mentioned in Tahoe to @robertnishihara etc. We will come back to you later when it is validated in production.

Said so, we still think a batch based approach valuable and it is useful for ray to come up the new capability for being a long running job service quickly.

istoica commented 6 years ago

@eric-jj and @imzhenyu thanks for the clarifications. A few things:

1) Regarding Ray submit, if anyone has cycles to submit a github issue with a short product requirement document (PRD) that would be great. We definitely need help here, especially from people like you that understand well the use case.

2) Regarding resource management would per-driver resource management be good enough? As you know, in Ray you can have multiple drivers in the same cluster, so maybe this is enough; just associate a job or a user with a driver. The second question here is what kind of policies we would need to implement: fair sharing, priority? Finally, again it would be great for someone to create a github issue and a PRD.

3) Regarding object eviction policy, there is no question that there are workloads for which LRU won't work well. However, it would be great to get more detailed examples. For instance, I'm still confused about this statement "... objects created by the long run job ... will be deleted because its' life time is longer than the objects' created by the small jobs." LRU will only evict the objects of the long jobs if they were not accessed recently; eviction is not really related to the object's lifetime. If you keep accessing the long job's objects, small jobs won't cause the eviction of these objects. Maybe in your application an object is not accessed for a long time since it was created.

Also, one question here. How useful would be an API like ray.unsafe_evict(object_list), which will force the eviction of all objects in object_list form the object store? Additionally, we could have ray.unsafe_evict(object_list, recursive=True), which will evict not only the objects in object_list, but also all objects on which these objects depend on. For instance, if obj is in object_list, then we evict not only obj, but also all "orphan" objects, i.e., objects whose descendants in the computation graph have been all removed. For example, in the following code:

xid = f.remote(a, b)
# use xid...
ray.unsafe_evict(xid, recursive=True)
# above instruction will cause not only the xid object to be evicted from object store,
# but also "a" and "b"

(I know that the above API might not be as easy to use, but certainly is very flexible.)

An alternative would be to provide an API like ray.pin(object_list) which makes sure that none of the objects in obj_list will be removed by LRU.

eric-jj commented 6 years ago

@istoica I will follow up the Ray Submit issue next week. For the per-driver resource management, I have not think about carefully, it should be ok to resolve our problem. For the policy issue, priority is a required one, but there are some other feature like security check, it seems can't be a common one for users out of ant financial. For the object eviction policy, some function like ray.unsafe_evict with recursive support should be fine to resolve our issue. I am fine to use the API to do that.

istoica commented 6 years ago

Thanks, @eric-jj.

BTW, it would still be great to understand the workloads for which doesn't LRU work well. Maybe there is an issue with the current implementation, rather than a LRU-policy problem?

Also, in addition to evicting all objects of a job when finishes execution, would pinning an object in the object store be good enough? Arguably pinning would be easier to implement than ray.unsafe_evict(), and also more consistent with memory management patterns in OSes: https://lwn.net/Articles/600502/

eric-jj commented 6 years ago

@istoica For LRU doesn't work, there are two types jobs are running together, job type A running for a long time, job type B running for short time and continue be submitted. When the type A is still running the first job of it, the job type B may be running for 100 times. The plasma are full occupied by the objects created by the job type B, but 99% of them are useless now. When the new objects are created, the object created by the type A will evicted from memory. And for another reason we want to control the GC because we are trying to apply ray to online calculation scenario as I mentioned. At that scenario, the total latency budget is less than 200ms. We can't afford do GC during the job running time. I think manually evicting all objects seems more reasonable for our case, and it is similar to our current solution running on the production. For pin objects mean user need un-pin objects, and it is hard to predict which objects should be pin when the objects are created. It is hard to predict that.

istoica commented 6 years ago

@eric-jj, thanks for clarification. So I assume that in this case, during B's executions, A's won't access its data too much.

Let's then design a job API, and free resources under the hood when the job ends. This solution would require no batch API.

Here is a possible API for Python:

# @ray.job assigns (1) a job id which is propagated to all tasks/actors/objects which 
# are in the execution graph rooted at this task, (2) takes some scheduling 
# arguments (e.g., priority), (3) cleans up the objects and task table when the 
# job finishes.  
@ray.job(name = "name", sched_args = sched_arg_list, ...)
def f(...)
      # code of the job here

# if cleanup=True, clean up all state associate to the job when it finishes, including
# its object in the object store, and the associated GCS state; otherwise don't clean
# up any state.
jobId = f.job(..., cleanup=True)
# if you want to wait for finishing the job, or the job returns a result, 
# then just use ray.get()
jobRes = ray.get(jobId)

Alternatively, we can define a class/actor that also has methods to get the job progress. Maybe this is the way to go. Note that under the hood, we could use the batch implementations.

So it seems to me that the most important next steps are to define the job concept (e.g. what are its properties, e.g., name, unique id, scheduling parameters, ability to get a progress report, ...), and the API, including a submit API. Anyone would like to start a PRD on this?

eric-jj commented 6 years ago

@istoica Yes. Job level management is more make sense. We introduce the batch because we it will change less code. As I mentioned before, I planned to do it step by step. Implement the batch first, then build job on it.
I can ask someone in our side to write the design doc for the job first, then we can finalize the design on the design doc and go to the detail implementation.

istoica commented 6 years ago

Sounds good @eric-jj. I'help with the PRD document. It would be great to do it over the next couple of days. Thanks.

pcmoritz commented 6 years ago

A different API I'd propose here is to use context managers for this:

Usage with the with statement:

@ray.remote
def f(something):
    # Do something

with ray.job(name="jobname"):
    x = f.remote(1)
    # Do other things

# The object "x" can now be cleaned up

with ray.job(name="jobname2"):
   y = f.remote(1)
   # Do more

Context managers can also be used without the with statement if needed, see the python documentation: https://docs.python.org/2/reference/datamodel.html#context-managers

The semantics here would be that there is one default job per driver and you can create custom ones that inherit all objects and function definitions from that default job. It is not allowed to pass objects between these inner jobs, so after the context manager is closed on the driver and all tasks from that job have finished, things can be cleaned up.

eric-jj commented 6 years ago

@pcmoritz yes, with operation seems more reasonable for the scenario. @guoyuhong please add it to the design doc.

istoica commented 6 years ago

@eric-jj , @guoyuhong , @pcmoritz , @robertnishihara, @ericl here is a draft of the product requirements document for jobs: https://docs.google.com/document/d/1XarOe4QKKcFbHRzy-OH5gGQz25dNBl3tCWDK-9oOyEY/edit#

We should make sure that the design of the "garbage collection" is consistent with jobs. They clearly have overlapping requirements, and don't want to have multiple APIs or implementations.

Regarding the with API we need to make sure that it provides the developer with the ability to (1) get an unique ID of the job's execution, (2) specify whether to perform garbage collection, or not, on job's completion, and (3) get status of the job's execution (while this is not a part of the "must" requirements for now, the question is not whether to do it but when).

raulchen commented 6 years ago

Regarding the with API we need to make sure that it provides the developer with the ability to (1) get an unique ID of the job's execution, (2) specify whether to perform garbage collection, or not, on job's completion, and (3) get status of the job's execution (while this is not a part of the "must" requirements for now, the question is not whether to do it but when).

@istoica These requirements can be easily implemented on top of context manager. The code will be something like this:

with ray.job(name="name", cleanup_on_completion=True, resources={...}) as job: # we may also want to limit resources assigned to this job
    # do something here
    print(job.id(), job.status())
raulchen commented 6 years ago

Another thing to note is that, this API requires a global variable to represent the current job id. And having mutable global variables is error prone when the worker program becomes concurrent. If using multi-threading, the fix can be as easy as storing the global state in thread locals. But if using py3 asyncio, maintaining the state correctly will be very hard when multiple jobs running in multiple coroutines in the one thread. Thus, I'd recommend avoiding global states if possible.

istoica commented 6 years ago

@raulchen I'm afraid I do not follow your comment.

The ID I was referring to is associated with each job execution and is immutable.

A job is a bunch of code (e.g., a jar) that you can execute multiple times with different inputs and different arguments. The job has a name. Each job execution has a unique ID.

Job vs job execution is similar to remote function vs task, and to remote class vs actor.

raulchen commented 6 years ago

@istoica In my above code example, by with ... as job, I actually meant a running job instance (or a job execution). Actually, the term job sounds more like the dynamic running instance to me. Maybe we can use JobSpec to represent job's specification (code, priority, resources, etc) and Job to represent the running instance?

istoica commented 6 years ago

@eric-jj, I see. These naming could make sense. Maybe leave some comments in the PRD: https://docs.google.com/document/d/1XarOe4QKKcFbHRzy-OH5gGQz25dNBl3tCWDK-9oOyEY/edit# ?

eric-jj commented 6 years ago

@istoica We have added the required features to the doc. There is 1 Must have feature - "Ability to specify job timeout.", and two Nice to Have feature -"Ability to apply custom scheduling policy by cluster manager." and "Ability to query the state of a job and provide a http server on the query API." from our side.

istoica commented 6 years ago

@eric-jj, thanks. Can you please add some clarifications about the state you would like to query? Is this state defined by the job (e.g., how much data was processed, accuracy for a training job) or some generic state (e.g., how many tasks have executed)? Or both?

eric-jj commented 6 years ago

@istoica We need some general information not the job internal status, such as is it job in the queue or has been scheduled, processed time, estimation remain time.

istoica commented 6 years ago

@eric-jj sounds good. One thing though: how can you estimate the job remaining time? I think we should have to types of status: (1) generic (i.e., waiting, running, completed), and (2) specific status which can be provided by the developer (e.g., estimated remaining time, progress, accuracy, etc).

raulchen commented 6 years ago

@istoica I didn't think thoroughly about what data should be in the query api. But I think monitoring is certainly a must for a mutual system. We can keep data structure simple and expandable in the initial implementation.

istoica commented 6 years ago

all, today we had a review of the PRD at Berkeley. The latest document reflects the feedback. Also please note a few more comments from the group. I think we are converging. Please take a look and try to address the comments that refer to your part of the proposal. Also, for the Ant Financial people please make sure that the current proposal satisfies all your needs. Thanks.

istoica commented 6 years ago

@raulchen we agree about the status. The question is how to let the user add more job-specific status information. A solution that is quite ugly is to have a generic status() API, and another API, internal_status() where the developer can provide job specific API.

raulchen commented 6 years ago

@istoica I think this API should only focus on generic status (in system level, e.g., job state, number of tasks/objects, etc). This is used for monitoring job's healthiness and managing jobs.

It will be much more complex to also include application-level status (e.g., training accuracy), because 1) as you mentioned, we'll need to provide an API to let users define their metrics; 2) users need to tell us how to aggregate and display the metrics. For application-level metrics, it makes sense to let users collect and process the metrics by themselves.

eric-jj commented 6 years ago

@istoica We have an internal discussion in ant financial. We decide the separate the concept of batch api and APP management api.

User can submit a ray APP(a py file or a jar file) to ray cluster. User can submit job from either command line or the http portal. A http portal will be set up(run on gcs?) for user submitting APP, view APP status, cancel or re-submit an APP. User can assign priority for an APP(The max priority of a user and max resource usage for a user is managed by the cluster master). Scheduling module will schedule the task created by the APP by the priority and resource constrict. User can also use custom policy for the scheduling. Almost all the APP API are transparent to user, except the submit command line.

Batch API is used for manage the objects created in the batch scope. Batch scope can also set a timeout to do better task management(to avoid deadlock, the feature is also important for realtime scenario). The batch scope can be implemented by the with(python) and try(java). The resources allocated in the batch scope will be evicted out of scope except user make it visible manually.

istoica commented 6 years ago

@eric-jj Sounds good. Can you please create the PRDs for job submit and batch? To make it easy, I made two copies of the previous Jobs PRD document (please feel free to modify them so that each of them contains only the relevant information):

I'd be happy to take a pass once you are done. Please let me know.

Also, when we get to the implementation, it would be great for the resource management and scheduling for batches and jobs to share the same mechanisms/implementation or to build on top of each other. Thanks.

eric-jj commented 6 years ago

@istoica The Batch PRD is almost done. For the Job submit PRD, we are still working with the AI team, we expect to finalize it on early next week.

istoica commented 6 years ago

@eric-jj the batch PRD looks really good. Just left a few comments. (Sorry, what is you google e-mail address?)

eric-jj commented 6 years ago

@istoica have replied your comments and my gmail address is eric.jj@gamil.com

istoica commented 6 years ago

Thanks, @eric-jj. I added a few more comments.