issues
search
pytorch
/
torchtitan
A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k
stars
115
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
RoPE implementation differences
#335
rlrs
closed
1 month ago
7
separate outputs
#334
wconstab
closed
1 month ago
0
[checkpointing] import async checkpoint with pinned memory only when needed
#333
tianyu-l
closed
1 month ago
0
try nvidia-cuda docker img, should clone faster
#332
wconstab
closed
1 month ago
0
try nvidia-cuda
#331
wconstab
closed
1 month ago
0
maybethis works?
#330
wconstab
closed
1 month ago
0
try with faster docker
#329
wconstab
closed
1 month ago
0
fix
#328
wconstab
closed
1 month ago
0
Add 8gpu runner
#327
wconstab
closed
1 month ago
0
fix
#326
wconstab
closed
1 month ago
0
Use torch generic workflow for CI, add ssh, artifacts
#325
wconstab
closed
1 month ago
0
Debug nccl hang
#324
wconstab
closed
1 month ago
0
Update requirements.txt
#323
qiziAI
closed
1 month ago
7
Make Transformer tolerate missing layers for PP
#322
wconstab
closed
1 month ago
2
Refactor freqs_cis slice to be safer for PP
#321
wconstab
closed
1 month ago
0
selective compilation - norm layers only
#320
lessw2020
opened
1 month ago
2
Add support of DDP and CompiledAutograd.
#319
fegin
closed
5 days ago
0
Add Pipeline Parallel (and 2D PP+FSDP) support
#318
wconstab
closed
1 month ago
1
numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used
#317
tianyu-l
opened
1 month ago
1
`freqs_cis` in llama model should be a non-persistent buffer
#316
tianyu-l
opened
1 month ago
0
Only include checkpoints that have .metadata written
#315
liangluofb
closed
1 month ago
0
simplify embedding + first transformer block TP
#314
wanchaol
closed
1 month ago
2
Implement async_checkpoint
#313
fegin
closed
1 month ago
1
Question on Model Init
#312
XinDongol
opened
1 month ago
7
add doc for adding custom dataset
#311
lessw2020
opened
1 month ago
0
Custom dataset for llama 3 finetuning
#310
rshah918
closed
1 month ago
2
[Feature] Add fineweb dataset
#309
viai957
closed
1 month ago
1
WIP apply PP manually
#308
wconstab
closed
1 month ago
2
Converting to checkpoint.pd is not working
#307
viai957
closed
1 month ago
5
freezeing some part of the model
#306
tianyu-l
opened
2 months ago
0
reload existing llama checkpoints
#305
tianyu-l
opened
2 months ago
10
add config option to only produce tensorboard logs on rank 0
#304
tianyu-l
closed
1 month ago
0
[fused_rmsnorm] Register as a custom operator for tracing
#303
wconstab
closed
2 weeks ago
8
Implement async_checkpoint
#302
fegin
closed
1 month ago
0
[fused_rmsnorm] Avoid querying device inside forward
#301
wconstab
closed
2 weeks ago
1
[fused_rmsnorm] Avoid conditional on dynamic stride
#300
wconstab
closed
2 weeks ago
2
Renamed `bsz` to `bs` for consistency; removed dead code
#299
awgu
closed
2 months ago
0
Remove unnecessary .to() inside model forward
#298
wconstab
closed
2 months ago
0
turn off dynamic shape for torch.compile
#297
wanchaol
closed
2 months ago
0
register fused rmsnorm as pytorch custom op
#296
tianyu-l
opened
2 months ago
0
remove unnecessary install of torchtitan
#295
tianyu-l
closed
2 months ago
0
[wip] differentiate Rstd vs rstd
#294
lessw2020
opened
2 months ago
0
Fix the incorrect step log for profiler after resuming from a checkpoint
#293
fegin
closed
2 months ago
1
[Feature] Add gradient accumulation
#292
XinDongol
opened
2 months ago
7
Make dataloader stateful?
#291
XinDongol
closed
1 month ago
9
Probably shouldn't call `init_weights` in constructor of the model
#290
ad8e
closed
2 months ago
4
Add periodic integration test with signal
#289
gnadathur
closed
2 months ago
0
fix 3d mesh order
#288
wanchaol
closed
2 months ago
0
unify data loading from HF and from disk
#287
tianyu-l
closed
2 months ago
0
Wrong mesh order
#286
ad8e
closed
2 months ago
1
Previous
Next