sql-machine-learning elasticdl issues

sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework

https://elasticdl.org

MIT License

733 stars 113 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Set backward passes per step for DistributedOptimizer

#2449 workingloong closed 3 years ago
0
The worker sends the training parameters to the master.

#2448 workingloong closed 3 years ago
0
The master only exit according to the status of workers

#2447 workingloong closed 3 years ago
0
Raise a runtime error if fail to perform allreduce operation

#2446 workingloong closed 3 years ago
0
The worker sends start and end message to the master for AllReduce.

#2445 workingloong closed 3 years ago
1
Rearrange worker pod priority

#2444 skydoorkai closed 3 years ago
1
Move grpc_utils.py and log_utils.py to util folder in elasticai_api.

#2443 brightcoder01 closed 3 years ago
2
Add requirements for elasticai_api package.

#2442 brightcoder01 closed 3 years ago
0
Bump version.

#2441 brightcoder01 closed 3 years ago
0
Move DataShardService and AllReduceControllers from elasticdl package to elasticai_api package.

#2440 brightcoder01 closed 3 years ago
0
Set the max seconds to check rendezvous

#2439 workingloong closed 3 years ago
0
Separate master RPC service to into master and TrainLoopMaster.

#2438 brightcoder01 closed 3 years ago
0
Remove get_model_version method from master_client because master service doesn't publish this RPC method.

#2437 brightcoder01 closed 3 years ago
0
Set backward_passes_per_step when the worker number changes.

#2436 workingloong closed 3 years ago
0
Move the proto messages about dynamic sharding into elasticai_api.proto.

#2435 brightcoder01 closed 3 years ago
0
Add an argument to disable the thread to check timeout tasks.

#2434 workingloong closed 3 years ago
0
Bump tensorflow from 2.1.2 to 2.4.0 in /elasticdl_preprocessing

#2433 dependabot[bot] closed 3 years ago
1
Bump tensorflow from 2.1.2 to 2.4.0 in /elasticdl

#2432 dependabot[bot] closed 3 years ago
1
Move the common constants into elasticai_api and add elasticai_api in the requirements of elasticdl.

#2431 brightcoder01 closed 3 years ago
0
Add up the completed global batch count.

#2430 workingloong closed 3 years ago
1
Keep the name of args in torch optimizer same as TF opt

#2429 workingloong closed 3 years ago
0
Conver the timout time with seconds from env to int

#2428 workingloong closed 3 years ago
0
Check whether to initialize Horovod periodically according to the timeout of Gloo.

#2427 workingloong closed 3 years ago
0
Fix the bug without executing zero_grad in DistributedOptimizer of Pytorch

#2426 workingloong closed 3 years ago
0
Use elastic Horovod to run model locally

#2425 workingloong closed 3 years ago
0
Use the max completed time of task to check timeout tasks.

#2424 workingloong closed 3 years ago
0
Log the rank and world size when the worker initialize Horovod

#2423 workingloong closed 3 years ago
0
Initial version of elasticai_api package.

#2422 brightcoder01 closed 3 years ago
0
Add an unittest for pytorch DistributedOptimizer

#2421 workingloong closed 3 years ago
0
Master does not exit if there are no worker

#2420 workingloong closed 3 years ago
0
Check whether the task is a valid training task by task type

#2419 workingloong closed 3 years ago
0
Set ENV PYTHONUNBUFFERE=0 in the dockerfile

#2418 workingloong closed 3 years ago
0
ElasticDL Package Refactor

#2417 brightcoder01 closed 3 years ago
2
Check whether the value is the name of argument or the value of argument.

#2416 workingloong closed 3 years ago
0
A pytorch example to read original images with the custom dataloader

#2415 workingloong closed 3 years ago
0
Elastic AllReduce controller for TensorFlow 1.x

#2414 workingloong closed 3 years ago
0
Refactor the SavedModelExporter using the train_end_task to create the dataset.

#2413 workingloong closed 3 years ago
0
Remove the dependency of the method calls for the component creation.

#2412 brightcoder01 closed 3 years ago
0
Move the object and method about Pod state machine into a separate file.

#2411 brightcoder01 closed 3 years ago
0
Support gpu with type in resource

#2410 workingloong closed 3 years ago
0
Add the DistributedOptimizer for TF1.x for elastic training.

#2409 brightcoder01 closed 3 years ago
0
Set pipefail for the worker command

#2408 workingloong closed 3 years ago
0
Set the default type number of task to NONE

#2407 workingloong closed 3 years ago
0
Set the priority of relaunched worker to the priority of a deleted worker.

#2406 workingloong closed 3 years ago
1
Bump horovod from 0.20.0 to 0.21.0

#2405 brightcoder01 closed 3 years ago
0
Set distributed strategy to AllReduce when PS number is 0

#2404 workingloong closed 3 years ago
0
Set worker number into the env of workers.

#2403 workingloong closed 3 years ago
0
Record the start running time stamp in PodInfo and sort the live workers according to this property for allreduce.

#2402 brightcoder01 closed 3 years ago
0
Compare the process of rank 0 selection between Horovod elastic and elasticdl

#2401 brightcoder01 closed 3 years ago
3
Set the priority of at least 1 worker to high if the priority is a fraction.

#2400 workingloong closed 3 years ago
0

Previous Next