pytorch torchtitan issues

pytorch / torchtitan

A native PyTorch Library for large model training

BSD 3-Clause "New" or "Revised" License

1.28k stars 115 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

RoPE implementation differences

#335 rlrs closed 1 month ago
7
separate outputs

#334 wconstab closed 1 month ago
0
[checkpointing] import async checkpoint with pinned memory only when needed

#333 tianyu-l closed 1 month ago
0
try nvidia-cuda docker img, should clone faster

#332 wconstab closed 1 month ago
0
try nvidia-cuda

#331 wconstab closed 1 month ago
0
maybethis works?

#330 wconstab closed 1 month ago
0
try with faster docker

#329 wconstab closed 1 month ago
0
fix

#328 wconstab closed 1 month ago
0
Add 8gpu runner

#327 wconstab closed 1 month ago
0
fix

#326 wconstab closed 1 month ago
0
Use torch generic workflow for CI, add ssh, artifacts

#325 wconstab closed 1 month ago
0
Debug nccl hang

#324 wconstab closed 1 month ago
0
Update requirements.txt

#323 qiziAI closed 1 month ago
7
Make Transformer tolerate missing layers for PP

#322 wconstab closed 1 month ago
2
Refactor freqs_cis slice to be safer for PP

#321 wconstab closed 1 month ago
0
selective compilation - norm layers only

#320 lessw2020 opened 1 month ago
2
Add support of DDP and CompiledAutograd.

#319 fegin closed 5 days ago
0
Add Pipeline Parallel (and 2D PP+FSDP) support

#318 wconstab closed 1 month ago
1
numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used

#317 tianyu-l opened 1 month ago
1
`freqs_cis` in llama model should be a non-persistent buffer

#316 tianyu-l opened 1 month ago
0
Only include checkpoints that have .metadata written

#315 liangluofb closed 1 month ago
0
simplify embedding + first transformer block TP

#314 wanchaol closed 1 month ago
2
Implement async_checkpoint

#313 fegin closed 1 month ago
1
Question on Model Init

#312 XinDongol opened 1 month ago
7
add doc for adding custom dataset

#311 lessw2020 opened 1 month ago
0
Custom dataset for llama 3 finetuning

#310 rshah918 closed 1 month ago
2
[Feature] Add fineweb dataset

#309 viai957 closed 1 month ago
1
WIP apply PP manually

#308 wconstab closed 1 month ago
2
Converting to checkpoint.pd is not working

#307 viai957 closed 1 month ago
5
freezeing some part of the model

#306 tianyu-l opened 2 months ago
0
reload existing llama checkpoints

#305 tianyu-l opened 2 months ago
10
add config option to only produce tensorboard logs on rank 0

#304 tianyu-l closed 1 month ago
0
[fused_rmsnorm] Register as a custom operator for tracing

#303 wconstab closed 2 weeks ago
8
Implement async_checkpoint

#302 fegin closed 1 month ago
0
[fused_rmsnorm] Avoid querying device inside forward

#301 wconstab closed 2 weeks ago
1
[fused_rmsnorm] Avoid conditional on dynamic stride

#300 wconstab closed 2 weeks ago
2
Renamed `bsz` to `bs` for consistency; removed dead code

#299 awgu closed 2 months ago
0
Remove unnecessary .to() inside model forward

#298 wconstab closed 2 months ago
0
turn off dynamic shape for torch.compile

#297 wanchaol closed 2 months ago
0
register fused rmsnorm as pytorch custom op

#296 tianyu-l opened 2 months ago
0
remove unnecessary install of torchtitan

#295 tianyu-l closed 2 months ago
0
[wip] differentiate Rstd vs rstd

#294 lessw2020 opened 2 months ago
0
Fix the incorrect step log for profiler after resuming from a checkpoint

#293 fegin closed 2 months ago
1
[Feature] Add gradient accumulation

#292 XinDongol opened 2 months ago
7
Make dataloader stateful?

#291 XinDongol closed 1 month ago
9
Probably shouldn't call `init_weights` in constructor of the model

#290 ad8e closed 2 months ago
4
Add periodic integration test with signal

#289 gnadathur closed 2 months ago
0
fix 3d mesh order

#288 wanchaol closed 2 months ago
0
unify data loading from HF and from disk

#287 tianyu-l closed 2 months ago
0
Wrong mesh order

#286 ad8e closed 2 months ago
1

Previous Next