issues
search
stas00
/
ml-engineering
Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
10.63k
stars
641
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++`
#66
jeromeku
opened
5 days ago
1
Change "Gbps" to "GBps" to fix a tiny typo in network.README.md
#65
Txxx926
closed
1 week ago
0
grad checkpoint tiny error
#64
baochi0212
closed
1 week ago
4
fix tiny error
#63
baochi0212
closed
1 week ago
0
slurm job array change nodes
#62
ethanhe42
closed
1 week ago
1
slurm job array change nodes
#61
ethanhe42
closed
1 week ago
1
Update GH200 MAMF
#60
yaolu
closed
3 weeks ago
1
Max Achievable TFLOP/s on H100 without warmup
#59
OrenLeung
closed
1 week ago
1
MAMF - GH200
#58
frankschae
closed
3 weeks ago
10
MAMAF + AMD debug
#56
stas00
closed
1 month ago
0
fix table
#55
152334H
closed
1 month ago
1
fix typo
#54
yaolu
closed
2 months ago
1
add citation
#53
stas00
closed
3 months ago
0
Adding another logbook (kinda)
#52
boweiliu
opened
3 months ago
2
Fix in ai-battlefield.md
#50
andy-yangz
closed
4 months ago
1
Fix incorrect Nvidia retired GPU page size mention.
#49
cf-natali
closed
4 months ago
1
Fix a couple formulas rendering.
#48
cf-natali
closed
4 months ago
1
MFU + HFU redux
#47
stas00
closed
4 months ago
2
SWIGLU: clarifications
#46
stas00
closed
4 months ago
4
Question about the right hidden dim when using SwiGLU
#45
Thytu
closed
4 months ago
3
fix bf16 <-> fp16 dtype statement
#44
stas00
closed
5 months ago
0
fix tpu v4 hbm2 bw
#43
stas00
closed
5 months ago
0
fix typo in emulate multi node
#42
Thytu
closed
5 months ago
1
Question about changing precision post training
#41
Thytu
closed
5 months ago
2
TPU v4 has 1,200GB/s of mem bandwidth and not 2,400, right?
#40
rodrigo-f-nogueira
closed
5 months ago
1
Fix broken links.
#39
cf-natali
closed
5 months ago
1
[AI battlefield] Update NVLink bandwidths to uni-directional numbers.
#38
cf-natali
closed
5 months ago
1
ML
#37
lelikdr
closed
5 months ago
0
Add num_processes and num_machines to accelerate launcher
#36
adamlin120
closed
5 months ago
1
[Network] Complete missing sentence
#35
patrickvonplaten
closed
6 months ago
1
[Network] Some typos in the README
#34
patrickvonplaten
closed
6 months ago
1
discuss the solutions to Not fully recovering spikes
#32
pengzhangzhi
closed
6 months ago
7
Update README.md in network chapter, update bandwidth info
#31
kisseternity
closed
6 months ago
1
Conflicting opinions about streaming data from cloud storage?
#30
hacobe
closed
6 months ago
2
Update ai-battlefield.md
#29
findmyway
closed
6 months ago
1
Quarto Site
#28
saforem2
closed
6 months ago
3
Fix single node networking analysis
#27
haidark
closed
6 months ago
1
Update README.md
#26
pitmonticone
closed
6 months ago
1
Reorg 2
#25
stas00
closed
6 months ago
0
Add flash attention to overview
#24
Quentin-Anthony
closed
6 months ago
1
Clarification for gradient memory in mixed precision training
#23
SumanthRH
closed
7 months ago
3
Add cookbook and model co-design refs
#22
Quentin-Anthony
closed
7 months ago
1
restructuring tools
#21
stas00
closed
7 months ago
0
pip install -r build/requirements.txt fails due to github_md_utils
#20
ebowman
closed
7 months ago
3
Fix typo in README.md
#19
nicolapace
closed
7 months ago
1
fix typo
#18
g1y5x3
closed
8 months ago
1
Update emulate-multi-node.md
#17
saforem2
closed
8 months ago
2
Fix typo
#16
pitmonticone
closed
8 months ago
1
Improve folder structure
#15
heyimjonas
closed
6 months ago
3
Update ai-battlefield.md
#14
eryk-mazus
closed
9 months ago
1
Next