issues
search
microsoft
/
mup
maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.37k
stars
94
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
CNN utility
#79
JeremyCCHsu
closed
1 month ago
0
How to use with SSL methods like DINOv2?
#78
josephcappadona
opened
2 months ago
0
MuP for RNNs
#77
norikazu99
opened
2 months ago
0
Not getting perf improvements from muP at ~1.5B scale
#76
gordicaleksa
opened
2 months ago
0
fix: adopt mup/Transformers API for torch2.3
#75
emergenz
opened
2 months ago
0
MuP for Mamba
#74
norikazu99
opened
4 months ago
0
Refactor: Addressing Sources of User Error
#73
thomasfortin1
opened
5 months ago
1
Support FSDP usage
#72
janEbert
opened
5 months ago
1
Increasing coord check for the network output
#71
AkshitaB
opened
5 months ago
2
mu parametrization for gated-mlp and group-query attention
#70
ftgreat
opened
6 months ago
0
Reproducing Figure 1 using 'examples/Transformer/main.py'
#69
jndean
opened
8 months ago
0
coord_check for model that returns loss function directly
#68
ad8e
opened
9 months ago
0
Reproducing the validation accuracy vs learning rates curve on ResNet
#67
liulei277
opened
9 months ago
1
Questions for training gpt-2 using mup
#66
jiangjiadi
closed
9 months ago
6
add width_mult to optimizer dict
#65
marcobellagente93
opened
11 months ago
0
About Learning rate decay
#64
afcruzs
opened
11 months ago
2
Demo notebook
#63
edwardjhu
closed
11 months ago
0
Unclear `assert_hidden_size_inf` triggers
#62
dreavjr
closed
11 months ago
1
dim_feedforward
#61
dreavjr
closed
12 months ago
0
Usage with torch.compile in Pytorch 2?
#60
dreavjr
opened
1 year ago
2
FSDP support?
#59
platers
opened
1 year ago
3
Interpreting jitter in coordcheck
#58
leenachennuru
closed
9 months ago
2
Some questions about the implementation of muP.
#57
lepodl
opened
1 year ago
0
µTransfer across batch size && weight decay setting
#56
PanYue2023
opened
1 year ago
0
_rescale_parameters() inconsistent with the paper for the tied embedding scenario?
#55
ofivite
opened
1 year ago
2
Is it possible to also scale the depth of the model?
#54
ricomnl
opened
1 year ago
5
Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?
#53
ricomnl
closed
1 year ago
0
Reproducing the training loss vs learning rates curve on MLP
#52
jhj0411jhj
closed
1 year ago
5
Warmup schedule when changing the number of tokens/steps (GPT-3 experiment detail)
#51
sashaDoubov
opened
1 year ago
0
Positional Embeddings should be MuReadout parameters ?
#48
codedecde
opened
1 year ago
2
Does mup support fine tuning pretrained models
#46
jhj0411jhj
closed
1 year ago
2
Embedding Multiplier for Transformer - Clarification
#45
sashaDoubov
closed
1 year ago
2
Are Sequentials with list comprehension handled incorrectly?
#43
RobertBaruch
opened
1 year ago
2
interpreting coord checks
#42
llucid-97
closed
1 year ago
2
in mlp example: 2 problems
#41
yjjinjie
opened
1 year ago
1
Questions on learning schedule and binary classification
#40
FlamingHorizon
closed
1 year ago
12
Can base model be larger than target model?
#39
jhj0411jhj
closed
1 year ago
3
coord check plot improvements
#38
TevenLeScao
closed
1 year ago
1
Allowing users to create their own shapes
#37
TevenLeScao
closed
1 year ago
0
Should query layers in self-attention be initialized to 0 in practice?
#36
wang-zerui
closed
1 year ago
2
Plot bugfix
#35
TevenLeScao
closed
1 year ago
0
fix: dtype for newer torch versions
#33
zanussbaum
closed
1 year ago
1
Proper error return in coord_check.py
#32
TevenLeScao
closed
1 year ago
1
Finetuning a Pretrained Model Using MuP
#31
zanussbaum
closed
1 year ago
3
Issue in reproducing the training loss vs learning rates curve
#30
NicolasWinckler
closed
1 year ago
5
Are parameters with no "infinite" dimensions allowed?
#29
callumm-graphcore
closed
1 year ago
5
LayerNorm Gain and Bias Multipliers
#28
AWildridge
closed
1 year ago
2
MuP Coord Check not Working with Electra Style Model
#27
zanussbaum
closed
1 year ago
8
Has MuP been tested on segmentation models?
#26
isdj
opened
1 year ago
4
Should `base=None` be used in `set_base_shapes` for model used for tuning?
#25
callumm-graphcore
opened
1 year ago
2
Next