issues
search
sql-machine-learning
/
elasticdl
Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
733
stars
113
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Set backward passes per step for DistributedOptimizer
#2449
workingloong
closed
3 years ago
0
The worker sends the training parameters to the master.
#2448
workingloong
closed
3 years ago
0
The master only exit according to the status of workers
#2447
workingloong
closed
3 years ago
0
Raise a runtime error if fail to perform allreduce operation
#2446
workingloong
closed
3 years ago
0
The worker sends start and end message to the master for AllReduce.
#2445
workingloong
closed
3 years ago
1
Rearrange worker pod priority
#2444
skydoorkai
closed
3 years ago
1
Move grpc_utils.py and log_utils.py to util folder in elasticai_api.
#2443
brightcoder01
closed
3 years ago
2
Add requirements for elasticai_api package.
#2442
brightcoder01
closed
3 years ago
0
Bump version.
#2441
brightcoder01
closed
3 years ago
0
Move DataShardService and AllReduceControllers from elasticdl package to elasticai_api package.
#2440
brightcoder01
closed
3 years ago
0
Set the max seconds to check rendezvous
#2439
workingloong
closed
3 years ago
0
Separate master RPC service to into master and TrainLoopMaster.
#2438
brightcoder01
closed
3 years ago
0
Remove get_model_version method from master_client because master service doesn't publish this RPC method.
#2437
brightcoder01
closed
3 years ago
0
Set backward_passes_per_step when the worker number changes.
#2436
workingloong
closed
3 years ago
0
Move the proto messages about dynamic sharding into elasticai_api.proto.
#2435
brightcoder01
closed
3 years ago
0
Add an argument to disable the thread to check timeout tasks.
#2434
workingloong
closed
3 years ago
0
Bump tensorflow from 2.1.2 to 2.4.0 in /elasticdl_preprocessing
#2433
dependabot[bot]
closed
3 years ago
1
Bump tensorflow from 2.1.2 to 2.4.0 in /elasticdl
#2432
dependabot[bot]
closed
3 years ago
1
Move the common constants into elasticai_api and add elasticai_api in the requirements of elasticdl.
#2431
brightcoder01
closed
3 years ago
0
Add up the completed global batch count.
#2430
workingloong
closed
3 years ago
1
Keep the name of args in torch optimizer same as TF opt
#2429
workingloong
closed
3 years ago
0
Conver the timout time with seconds from env to int
#2428
workingloong
closed
3 years ago
0
Check whether to initialize Horovod periodically according to the timeout of Gloo.
#2427
workingloong
closed
3 years ago
0
Fix the bug without executing zero_grad in DistributedOptimizer of Pytorch
#2426
workingloong
closed
3 years ago
0
Use elastic Horovod to run model locally
#2425
workingloong
closed
3 years ago
0
Use the max completed time of task to check timeout tasks.
#2424
workingloong
closed
3 years ago
0
Log the rank and world size when the worker initialize Horovod
#2423
workingloong
closed
3 years ago
0
Initial version of elasticai_api package.
#2422
brightcoder01
closed
3 years ago
0
Add an unittest for pytorch DistributedOptimizer
#2421
workingloong
closed
3 years ago
0
Master does not exit if there are no worker
#2420
workingloong
closed
3 years ago
0
Check whether the task is a valid training task by task type
#2419
workingloong
closed
3 years ago
0
Set ENV PYTHONUNBUFFERE=0 in the dockerfile
#2418
workingloong
closed
3 years ago
0
ElasticDL Package Refactor
#2417
brightcoder01
closed
3 years ago
2
Check whether the value is the name of argument or the value of argument.
#2416
workingloong
closed
3 years ago
0
A pytorch example to read original images with the custom dataloader
#2415
workingloong
closed
3 years ago
0
Elastic AllReduce controller for TensorFlow 1.x
#2414
workingloong
closed
3 years ago
0
Refactor the SavedModelExporter using the train_end_task to create the dataset.
#2413
workingloong
closed
3 years ago
0
Remove the dependency of the method calls for the component creation.
#2412
brightcoder01
closed
3 years ago
0
Move the object and method about Pod state machine into a separate file.
#2411
brightcoder01
closed
3 years ago
0
Support gpu with type in resource
#2410
workingloong
closed
3 years ago
0
Add the DistributedOptimizer for TF1.x for elastic training.
#2409
brightcoder01
closed
3 years ago
0
Set pipefail for the worker command
#2408
workingloong
closed
3 years ago
0
Set the default type number of task to NONE
#2407
workingloong
closed
3 years ago
0
Set the priority of relaunched worker to the priority of a deleted worker.
#2406
workingloong
closed
3 years ago
1
Bump horovod from 0.20.0 to 0.21.0
#2405
brightcoder01
closed
3 years ago
0
Set distributed strategy to AllReduce when PS number is 0
#2404
workingloong
closed
3 years ago
0
Set worker number into the env of workers.
#2403
workingloong
closed
3 years ago
0
Record the start running time stamp in PodInfo and sort the live workers according to this property for allreduce.
#2402
brightcoder01
closed
3 years ago
0
Compare the process of rank 0 selection between Horovod elastic and elasticdl
#2401
brightcoder01
closed
3 years ago
3
Set the priority of at least 1 worker to high if the priority is a fraction.
#2400
workingloong
closed
3 years ago
0
Previous
Next