microsoft mup issues - Githubissues

microsoft / mup

maximal update parametrization (µP)

https://arxiv.org/abs/2203.03466

MIT License

1.37k stars 94 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

CNN utility

#79 JeremyCCHsu closed 1 month ago
0
How to use with SSL methods like DINOv2?

#78 josephcappadona opened 2 months ago
0
MuP for RNNs

#77 norikazu99 opened 2 months ago
0
Not getting perf improvements from muP at ~1.5B scale

#76 gordicaleksa opened 2 months ago
0
fix: adopt mup/Transformers API for torch2.3

#75 emergenz opened 2 months ago
0
MuP for Mamba

#74 norikazu99 opened 4 months ago
0
Refactor: Addressing Sources of User Error

#73 thomasfortin1 opened 5 months ago
1
Support FSDP usage

#72 janEbert opened 5 months ago
1
Increasing coord check for the network output

#71 AkshitaB opened 5 months ago
2
mu parametrization for gated-mlp and group-query attention

#70 ftgreat opened 6 months ago
0
Reproducing Figure 1 using 'examples/Transformer/main.py'

#69 jndean opened 8 months ago
0
coord_check for model that returns loss function directly

#68 ad8e opened 9 months ago
0
Reproducing the validation accuracy vs learning rates curve on ResNet

#67 liulei277 opened 9 months ago
1
Questions for training gpt-2 using mup

#66 jiangjiadi closed 9 months ago
6
add width_mult to optimizer dict

#65 marcobellagente93 opened 11 months ago
0
About Learning rate decay

#64 afcruzs opened 11 months ago
2
Demo notebook

#63 edwardjhu closed 11 months ago
0
Unclear `assert_hidden_size_inf` triggers

#62 dreavjr closed 11 months ago
1
dim_feedforward

#61 dreavjr closed 12 months ago
0
Usage with torch.compile in Pytorch 2?

#60 dreavjr opened 1 year ago
2
FSDP support?

#59 platers opened 1 year ago
3
Interpreting jitter in coordcheck

#58 leenachennuru closed 9 months ago
2
Some questions about the implementation of muP.

#57 lepodl opened 1 year ago
0
µTransfer across batch size && weight decay setting

#56 PanYue2023 opened 1 year ago
0
_rescale_parameters() inconsistent with the paper for the tied embedding scenario?

#55 ofivite opened 1 year ago
2
Is it possible to also scale the depth of the model?

#54 ricomnl opened 1 year ago
5
Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?

#53 ricomnl closed 1 year ago
0
Reproducing the training loss vs learning rates curve on MLP

#52 jhj0411jhj closed 1 year ago
5
Warmup schedule when changing the number of tokens/steps (GPT-3 experiment detail)

#51 sashaDoubov opened 1 year ago
0
Positional Embeddings should be MuReadout parameters ?

#48 codedecde opened 1 year ago
2
Does mup support fine tuning pretrained models

#46 jhj0411jhj closed 1 year ago
2
Embedding Multiplier for Transformer - Clarification

#45 sashaDoubov closed 1 year ago
2
Are Sequentials with list comprehension handled incorrectly?

#43 RobertBaruch opened 1 year ago
2
interpreting coord checks

#42 llucid-97 closed 1 year ago
2
in mlp example: 2 problems

#41 yjjinjie opened 1 year ago
1
Questions on learning schedule and binary classification

#40 FlamingHorizon closed 1 year ago
12
Can base model be larger than target model?

#39 jhj0411jhj closed 1 year ago
3
coord check plot improvements

#38 TevenLeScao closed 1 year ago
1
Allowing users to create their own shapes

#37 TevenLeScao closed 1 year ago
0
Should query layers in self-attention be initialized to 0 in practice?

#36 wang-zerui closed 1 year ago
2
Plot bugfix

#35 TevenLeScao closed 1 year ago
0
fix: dtype for newer torch versions

#33 zanussbaum closed 1 year ago
1
Proper error return in coord_check.py

#32 TevenLeScao closed 1 year ago
1
Finetuning a Pretrained Model Using MuP

#31 zanussbaum closed 1 year ago
3
Issue in reproducing the training loss vs learning rates curve

#30 NicolasWinckler closed 1 year ago
5
Are parameters with no "infinite" dimensions allowed?

#29 callumm-graphcore closed 1 year ago
5
LayerNorm Gain and Bias Multipliers

#28 AWildridge closed 1 year ago
2
MuP Coord Check not Working with Electra Style Model

#27 zanussbaum closed 1 year ago
8
Has MuP been tested on segmentation models?

#26 isdj opened 1 year ago
4
Should `base=None` be used in `set_base_shapes` for model used for tuning?

#25 callumm-graphcore opened 1 year ago
2