nkotak / 1.58BitNet

Experimental BitNet Implementation
GNU General Public License v3.0
60 stars 7 forks source link

7B+ failure reproduced #1

Closed complexinteractive closed 7 months ago

complexinteractive commented 8 months ago

In the reddit thread for this project (https://www.reddit.com/r/LocalLLaMA/comments/1bjjywn/helpserious_discussion_i_tried_my_hand_at_a_158/) @nkotak mentioned an inability to create a 7b "blank" model using new-model-architecture-creation.py. I have also been unable to do so. My attempt returned the following error after ~2hrs. I have included the very tail end of the log as it is quite verbose.

Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(6.3444e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-5.1990e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(8.0833e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(3.9145e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-7.9704e-07, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(1.1298e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(3.0832e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(-7.7646e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(7.4275e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(3.4942e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(5.8609e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-6.9366e-07, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-2.6196e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(2.4315e-06, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(3.9997e-07, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(8.1852e-08, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(2.2416e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(-6.1170e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(2.4827e-07, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(3.0198e-06, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(5.1411e-06, grad_fn=<MeanBackward0>) Layer 3 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 3 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 3 self-attention query projection weight mean: tensor(-2.3415e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention key projection weight mean: tensor(8.4680e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention value projection weight mean: tensor(4.9574e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention output projection weight mean: tensor(2.0827e-07, grad_fn=<MeanBackward0>) Layer 3 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 3 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 3 MLP gate projection weight mean: tensor(1.2293e-06, grad_fn=<MeanBackward0>) Layer 3 MLP down projection weight mean: tensor(-1.3864e-06, grad_fn=<MeanBackward0>) Layer 3 MLP up projection weight mean: tensor(-7.5944e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(2.5627e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(2.7076e-05, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0001, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(8.0365e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-9.8522e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-1.7996e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(7.9116e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(9.4589e-07, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(-1.2719e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(7.8258e-09, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(1.2310e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(-2.0548e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(5.0780e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(7.3589e-07, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-3.8465e-06, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(4.8067e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(2.8580e-06, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(-9.5641e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(-3.7096e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(-1.5798e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(9.3809e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(2.6255e-06, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(-3.6297e-08, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(-2.8717e-06, grad_fn=<MeanBackward0>) Layer 3 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 3 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 3 self-attention query projection weight mean: tensor(8.4573e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention key projection weight mean: tensor(-3.2043e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention value projection weight mean: tensor(-1.0644e-06, grad_fn=<MeanBackward0>) Layer 3 self-attention output projection weight mean: tensor(-1.8972e-06, grad_fn=<MeanBackward0>) Layer 3 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 3 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 3 MLP gate projection weight mean: tensor(-1.8230e-06, grad_fn=<MeanBackward0>) Layer 3 MLP down projection weight mean: tensor(1.9104e-06, grad_fn=<MeanBackward0>) Layer 3 MLP up projection weight mean: tensor(2.1429e-07, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(1.5264e-07, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(0.0002, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0001, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-6.3789e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-1.9288e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-1.9442e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(9.3770e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-4.3444e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(-2.5591e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(7.5069e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(4.2418e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(6.3533e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-6.1645e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(1.8911e-05, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(1.1490e-05, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(7.2251e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-1.2211e-05, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(-1.8986e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(-1.5562e-05, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(5.1529e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(-4.2312e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(1.1323e-05, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(-1.9680e-06, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(6.5049e-08, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-2.1981e-09, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(0.0001, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0000, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(1.3296e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-2.5620e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-9.4922e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(-1.5001e-05, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(3.1547e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(2.8242e-08, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(-5.6926e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(1.3623e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(1.8263e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-4.5819e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-3.7219e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(2.1893e-06, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(2.6225e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(5.5887e-06, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(9.1384e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(-2.2934e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(1.6383e-05, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(1.4233e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(2.1311e-06, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(-1.0960e-06, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(8.1361e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-2.6515e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(0.0002, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(0.9998, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-1.0402e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-2.8146e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(1.0754e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(3.0400e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(1.0289e-05, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(6.1278e-07, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(3.1685e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(1.9902e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(-6.3661e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-4.2757e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-7.4312e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(2.7734e-06, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-7.5877e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-3.2732e-06, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(-2.3130e-07, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(1.1226e-05, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(9.7487e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(3.8772e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(3.8567e-06, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(-1.8620e-07, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(-1.7808e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(2.7037e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(-8.5091e-06, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(0.9999, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(6.8064e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-1.7564e-08, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(1.1495e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(2.6604e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(4.8161e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(-6.3143e-07, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(-1.2893e-08, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(-8.1713e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(4.5801e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-2.5301e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-3.1184e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(8.3843e-07, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(1.3369e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(1.7616e-06, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(-2.6783e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(6.8324e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(9.7525e-07, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(4.4364e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(1.8683e-06, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(1.0083e-06, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(2.4835e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-4.3758e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(8.0567e-07, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0000, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-1.9968e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(8.5208e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-1.9221e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(-5.1271e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(2.7599e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(-1.1643e-07, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(1.4720e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(-1.0681e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(-3.1339e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-2.6497e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(4.9184e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-1.3057e-06, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-4.9504e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(9.4933e-07, grad_fn=<MeanBackward0>) Layer 2 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 self-attention query projection weight mean: tensor(3.2148e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention key projection weight mean: tensor(-3.6392e-06, grad_fn=<MeanBackward0>) Layer 2 self-attention value projection weight mean: tensor(-1.1596e-05, grad_fn=<MeanBackward0>) Layer 2 self-attention output projection weight mean: tensor(5.1346e-06, grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 2 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 2 MLP gate projection weight mean: tensor(3.2044e-06, grad_fn=<MeanBackward0>) Layer 2 MLP down projection weight mean: tensor(1.7624e-06, grad_fn=<MeanBackward0>) Layer 2 MLP up projection weight mean: tensor(3.4363e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(1.0791e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(-0.0002, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0000, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-1.3601e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(1.0939e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(4.7092e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(-7.1269e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-1.3362e-07, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(6.9651e-07, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(1.1492e-05, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(-8.6247e-07, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(1.3821e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-8.9459e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-8.3580e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(1.0843e-05, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(1.9032e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-9.3109e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-3.5536e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(0.0003, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0000, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(2.2450e-07, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(9.5589e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(2.8028e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(-6.9039e-07, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-7.4119e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(1.4634e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(-7.5722e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(3.9661e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(5.4944e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(1.2131e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(9.8214e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-2.6421e-07, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(1.2051e-06, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-6.4150e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-1.1546e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(-0.0001, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(0.9999, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-6.2497e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-5.6316e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-4.4034e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(8.2925e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(1.8532e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(1.1149e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(1.0081e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(-1.1461e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(-9.0714e-07, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(5.8455e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(1.0088e-05, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(9.2171e-08, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-6.5912e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-4.2613e-07, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-1.5534e-07, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(-5.8291e-06, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(1.0000, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(-5.1636e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(-7.4864e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-1.1957e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(-1.2247e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-1.3543e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(9.4276e-07, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(-1.1894e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(2.7391e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(1.2391e-05, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-3.0829e-07, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-4.8317e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-6.0593e-06, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-7.3393e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(-3.8876e-07, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(-1.2071e-06, grad_fn=<MeanBackward0>) Embedding layer weight mean: tensor(-0.0001, grad_fn=<MeanBackward0>) Embedding layer weight std: tensor(0.9999, grad_fn=<StdBackward0>) Layer 0 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 self-attention query projection weight mean: tensor(1.1628e-05, grad_fn=<MeanBackward0>) Layer 0 self-attention key projection weight mean: tensor(4.0839e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention value projection weight mean: tensor(-7.1175e-06, grad_fn=<MeanBackward0>) Layer 0 self-attention output projection weight mean: tensor(8.3189e-06, grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 0 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 0 MLP gate projection weight mean: tensor(-5.7158e-06, grad_fn=<MeanBackward0>) Layer 0 MLP down projection weight mean: tensor(2.0278e-06, grad_fn=<MeanBackward0>) Layer 0 MLP up projection weight mean: tensor(6.5929e-06, grad_fn=<MeanBackward0>) Layer 1 input layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 input layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 self-attention query projection weight mean: tensor(2.8405e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention key projection weight mean: tensor(7.0523e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention value projection weight mean: tensor(-4.6375e-06, grad_fn=<MeanBackward0>) Layer 1 self-attention output projection weight mean: tensor(-1.9025e-06, grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm weight mean: tensor(1., grad_fn=<MeanBackward0>) Layer 1 post-attention layernorm bias mean: tensor(0., grad_fn=<MeanBackward0>) Layer 1 MLP gate projection weight mean: tensor(-8.7956e-07, grad_fn=<MeanBackward0>) Layer 1 MLP down projection weight mean: tensor(-8.0686e-07, grad_fn=<MeanBackward0>) Layer 1 MLP up projection weight mean: tensor(3.7786e-06, grad_fn=<MeanBackward0>) Output layer norm weight mean: tensor(1., grad_fn=<MeanBackward0>) Output layer norm bias mean: tensor(0., grad_fn=<MeanBackward0>) Language model head weight mean: tensor(3.8995e-06, grad_fn=<MeanBackward0>) Traceback (most recent call last): File "/Users/quentin/Development/1.58BitNet/new-model-architecture-creation.py", line 103, in <module> model = LlamaModel(config) ^^^^^^^^^^^^^^^^^^ File "/Users/quentin/Development/1.58BitNet/llama_model.py", line 200, in __init__ self.embed_tokens = QuantizedEmbedding(config.vocab_size, config.hidden_size) ^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'vocab_size'

A subsequent attempt to create a 600M parameter blank was at least partly successful. The model itself was created, but the process returned the following error: AssertionError: Quantization error exceeds tolerance My system: Apple M1 Max, 32GB

dronesflier commented 8 months ago

I can at least add that I recieve the same AssertionError: Quantization error exceeds tolerance error when attempting to create a 350M Model.

system specs if they matter: 32gb system ddr4; nvidia 3060 12gb, i7-6700k; 80gb swap file on nvme ssd

nkotak commented 7 months ago

I just committed new changes to the repo. Please try again with the updated code changes.

complexinteractive commented 7 months ago

@nkotak I had to abort the generation attempt for a 5B blank model 4.5hrs in because the architecture generation process used ~20GB more memory than the last version and the RAM swapping put several terabytes of write wear on my disk. The previous version was much more manageable with my resources, although I suppose that's moot given it didn't actually work.

nkotak commented 7 months ago

@complexinteractive the last code unfortunately didn’t conform to the bitnet output. It was quantizing all the layers in the model and according to the new paper and some of the comments on huggingface, that’s not how it’s supposed to work.

On a consumer machine it will be tough for anything above 3B.

complexinteractive commented 7 months ago

@nkotak thanks for the clarification. Shame to see the memory advantages of the architecture don't extend to training.

nkotak commented 7 months ago

Closing this since this particular issue should be fixed.