issues
search
msr-fiddle
/
pipedream
MIT License
379
stars
117
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Can you tell me how to run this code with the RTX 4090D? How should I configure it and run this patch?
#81
hnust-xxq
opened
1 month ago
0
same train_loader but got different loader size
#80
Hyaloid
closed
10 months ago
2
optimizer got an empty parameter list when rank=1
#79
Hyaloid
closed
11 months ago
1
When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0):
#78
lengien
opened
1 year ago
9
what is the role of pre_hook_pytorch_latest.patch?
#77
matrix97317
opened
1 year ago
1
Running in docker will give you an error that you can't find a physical address
#76
guanyonglai
opened
1 year ago
1
AttributeError: module 'torch.distributed' has no attribute 'P2POp'
#75
guanyonglai
closed
1 year ago
1
Is there any 2bw code that will run on the native GPU
#74
guanyonglai
closed
1 year ago
1
AttributeError: module 'models.resnet50.resnet50' has no attribute 'model'
#73
guanyonglai
closed
1 year ago
1
CVE-2007-4559 Patch
#72
TrellixVulnTeam
opened
2 years ago
0
Question about PipeDream's optimizer
#71
lllukehuang
opened
2 years ago
0
Question about time complexity of PipeDream-2BW's planner algorithm
#70
ConnollyLeon
opened
2 years ago
0
The arguments of self.start_helper_thread() should be more flexible instead of fixed as int64.
#69
gouchangjiang
opened
2 years ago
0
Supporting T5
#68
gperrotta
closed
1 year ago
0
modify runtime/image_classification/models/resnet50/gpus=2/__init__.py
#67
SHEELE41
opened
3 years ago
0
How is the Double-Buffered Weight Mechanism implemented?
#66
BinhangYuan
opened
3 years ago
0
Is there AllReduce in data parallelism?
#65
Allen-Czyysx
closed
3 years ago
6
GPT2 355m model convergence with 2BW training
#64
nitikasaran68
opened
3 years ago
0
The BLEU score of translation model seems abnormal. The model doesn't seem to train effectively.
#63
njuyexiangyu
opened
3 years ago
0
To run PipeDream_2BW branch without --recompute_step
#62
Shigangli
closed
3 years ago
0
Resource temporarily unavailable
#61
liulixinkerry
closed
3 years ago
0
GPU Peer2Peer communication via --num_ranks_in_server argument
#60
siddharth9820
opened
4 years ago
1
Handling uneven number of batches per replicated instance of a layer
#59
siddharth9820
opened
4 years ago
0
Running a transformer module
#58
oranichu
closed
4 years ago
0
Planner for PipeDream-2BW
#57
nict-wisdom
opened
4 years ago
5
Communication error when training with Pipedream
#56
gudiandian
closed
4 years ago
6
Hanging with [4,3,1] GPU assignment
#55
BestSonny
closed
4 years ago
0
Can the profiler handle dynamic graphs?
#54
rahul003
opened
4 years ago
0
What is the meaning of `antichain` in optimizer_graph_hierarchical.py ?
#53
sergei-mironov
opened
4 years ago
0
What's the latest version of PyTorch supported?
#52
SimonZsx
opened
4 years ago
10
How to expose the "register_pre_hook()" interface?
#51
letian-zhang
opened
4 years ago
1
Actual results did not match the optimizer expectation
#50
nirandaperera
opened
4 years ago
1
Translation demo: Division by zero
#49
sergei-mironov
opened
4 years ago
4
Translation demo: Installation instrutions issue. Missing CUDA kernels.
#48
sergei-mironov
closed
4 years ago
3
Batch size and optimizer
#47
nirandaperera
closed
4 years ago
6
Is this the version used in SOSP paper?
#46
nirandaperera
closed
4 years ago
1
Infinite loop problem in convert_graph_to_model.py
#45
jayhpark530
opened
4 years ago
3
Error occurred in profiling
#44
gudiandian
closed
4 years ago
3
Some error about communication
#43
jglicat
closed
10 months ago
4
docker pull error
#42
cnzhanj
closed
4 years ago
1
I can't get the same result as your model after segmentation
#41
ADAM-CT
opened
4 years ago
2
whats the role of optimizer/inference_optimizer_graph.py
#40
ADAM-CT
closed
4 years ago
1
All ranks are not trained. They are blocked all the time
#39
ADAM-CT
opened
4 years ago
3
"stage_to_depth_map" not found
#38
ADAM-CT
closed
4 years ago
2
How to determine replication factors
#37
ADAM-CT
opened
4 years ago
4
Multi node training
#36
ADAM-CT
opened
4 years ago
6
bandwidth parameter
#35
ADAM-CT
opened
4 years ago
2
Multi-machine distribution problem
#34
ADAM-CT
opened
4 years ago
6
I would like to know what the role of the following driver.py is?
#33
ADAM-CT
opened
4 years ago
2
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai
#32
ADAM-CT
closed
4 years ago
3
Next