pytorch torchtitan issues

pytorch / torchtitan

A native PyTorch Library for large model training

BSD 3-Clause "New" or "Revised" License

2.65k stars 206 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Is `autocast` needed with FSDP2?

#700 garrett361 opened 27 minutes ago
0
W&B wandb support

#699 msaroufim opened 2 hours ago
0
fix float8 delayed scaling configuration

#698 vkuzo closed 34 minutes ago
0
some cleanups

#697 tianyu-l opened 1 day ago
0
[question] Need clarification on the purpose and performance benefits of GarbageCollection class

#696 qsh-zh opened 2 days ago
5
recommended practices for loss converging tests

#695 tianyu-l opened 2 days ago
0
Vote on new features in Discussions

#694 tianyu-l opened 2 days ago
2
Move RNG seed stuff out of 'set_determinism'

#691 wconstab closed 2 days ago
1
Control 'deterministic' mode separately from rng seed

#690 wconstab closed 2 days ago
1
Configure RNGs appropriately for Pipeline + SPMD

#689 wconstab opened 3 days ago
0
[rfc] torchtitan release practices

#688 tianyu-l opened 3 days ago
0
Question about FSDP2 + FP8 all gather

#687 sbhavani closed 3 days ago
3
necessary changes to unblock Sequence Parallel on odd length sequences

#686 tianyu-l opened 5 days ago
0
[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage

#685 XilunWu opened 5 days ago
0
[cp] add option to choose kv shards rotation method

#684 XilunWu opened 5 days ago
0
[cp] fix the device mesh access issue when CP is not used with DP

#683 XilunWu opened 5 days ago
0
[pp] Add support for loading schedule csv

#682 H-Huang closed 3 days ago
0
torch.compile(sync_float8_amax_and_scale_history) not working with triton latest main

#681 goldhuang opened 6 days ago
1
[Parallelism] Implement vocabulary parallelism

#680 casper-hansen opened 1 week ago
1
Question about integration with DeepSpeed-Ulysses

#679 zigzagcai closed 3 days ago
2
Any suggestion for Llama-3.1-70b(128k seq len) deploy mesh with torchtian?

#678 medivh-xp opened 1 week ago
8
Fine-Tuning Llama Model with Large Context and Customized Dataset Using Torchtitan

#677 Amerehei opened 1 week ago
7
Very low wps with H200 Gpus

#676 aniltrkkn opened 1 week ago
6
fsdp2

#675 zigzagcai closed 1 week ago
1
Batchnorm support with FSDP2

#674 vighneshbirodkar closed 2 weeks ago
5
Implement sft

#673 aniltrkkn closed 2 weeks ago
1
support 3rd-party backend

#672 qiongerfei closed 4 days ago
4
FSDP2 mixed precision error

#671 jiagaoxiang closed 2 weeks ago
4
Equivalence of `sync_module_states` in fsdp2

#670 qsh-zh closed 3 weeks ago
4
Low Bit Optimizer Support

#669 nighting0le01 closed 2 weeks ago
1
Why use TF32 Tensorcore Peak Flops for MFU calculation?

#668 LeoXinhaoLee closed 3 weeks ago
5
[BE] replace the extra DeviceMesh _flatten with mesh access

#667 XilunWu closed 3 weeks ago
0
[BE] replace the extra DeviceMesh _flatten with mesh access

#666 XilunWu closed 3 weeks ago
1
[BE] remove old pytorch version warning on strided sharding since 2.5 is official released

#665 XilunWu closed 3 weeks ago
0
Add test for toml-based pp split points

#664 wconstab closed 3 weeks ago
0
[WIP] Adding OBELICS DataLoader

#663 TJ-Solergibert opened 3 weeks ago
2
[Config] Make the checkpoint `step` configurable.

#662 casper-hansen opened 3 weeks ago
3
[not for land] torch.compile individual linears

#661 vkuzo opened 3 weeks ago
0
`empty_cache` before `barrier`

#660 carmocca opened 3 weeks ago
1
Fix data_parallel_shard_degree description

#659 carmocca closed 3 weeks ago
0
Questions about FSDP2 support and memory usage.

#658 tangjiasheng opened 3 weeks ago
6
add paper citation

#657 tianyu-l closed 3 weeks ago
0
Port #642's loss changes to estimation.py

#656 carmocca closed 4 weeks ago
0
Do not destroy if the world did not init

#655 carmocca closed 4 weeks ago
1
meta device issue with float8 delayed scale

#654 weifengpy opened 1 month ago
8
When to use enable_fsdp_float8_all_gather?

#653 goldhuang closed 1 month ago
1
torch.distributed.breakpoint(rank=1) hangs because of --local-ranks-filter 0

#652 weifengpy opened 1 month ago
0
FP8Linear saves new parameters in ckpt and I cannot load the saved ckpt

#651 goldhuang closed 4 days ago
6
[Multimodal] Adding OBELICS DataLoader

#650 TJ-Solergibert opened 1 month ago
8
Fix PP clip_grad_norm

#649 zijian-hu closed 1 week ago
1