karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.58k stars 2.58k forks source link

This usually indicates a bug. #215

Open macmmm81 opened 6 years ago

macmmm81 commented 6 years ago

I'm using Ubuntu 16.04, amd64 and CUDA (GTX 970). When I was training this happened:

th train.lua -data_dir data/xyz -rnn_size 512 -num_layers 3 -dropout 0.5 -batch_size 150 using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 190, val: 10, test: 0
vocab size: 78
creating an lstm with 3 layers
setting forget gate biases to 1 in LSTM layer 1 setting forget gate biases to 1 in LSTM layer 2 setting forget gate biases to 1 in LSTM layer 3 number of parameters in the model: 5454926
cloning rnn cloning criterion
1/9500 (epoch 0.005), train_loss = 4.37936428, grad/param norm = 8.3750e-01, time/batch = 0.7845s
2/9500 (epoch 0.011), train_loss = 4.27539656, grad/param norm = 9.4067e-01, time/batch = 0.1877s
3/9500 (epoch 0.016), train_loss = 4.23837694, grad/param norm = 9.0943e-01, time/batch = 0.1698s
4/9500 (epoch 0.021), train_loss = 3.66106952, grad/param norm = 7.9275e-01, time/batch = 0.1681s
5/9500 (epoch 0.026), train_loss = 5.62951912, grad/param norm = 2.1572e+00, time/batch = 0.1689s
6/9500 (epoch 0.032), train_loss = 3.60939394, grad/param norm = 6.0602e-01, time/batch = 0.1679s
7/9500 (epoch 0.037), train_loss = 3.53268805, grad/param norm = 7.0176e-01, time/batch = 0.1682s
8/9500 (epoch 0.042), train_loss = 3.53769521, grad/param norm = 7.2233e-01, time/batch = 0.1686s
9/9500 (epoch 0.047), train_loss = 3.48180217, grad/param norm = 4.9382e-01, time/batch = 0.1681s
10/9500 (epoch 0.053), train_loss = 3.39289360, grad/param norm = 2.8195e-01, time/batch = 0.1682s
11/9500 (epoch 0.058), train_loss = 3.39644097, grad/param norm = 3.4993e-01, time/batch = 0.1674s
12/9500 (epoch 0.063), train_loss = 3.41690555, grad/param norm = 3.3962e-01, time/batch = 0.1668s
13/9500 (epoch 0.068), train_loss = 3.38753114, grad/param norm = 3.0437e-01, time/batch = 0.1666s
14/9500 (epoch 0.074), train_loss = 3.34901290, grad/param norm = 1.5515e-01, time/batch = 0.1672s
15/9500 (epoch 0.079), train_loss = 3.35579167, grad/param norm = 1.6236e-01, time/batch = 0.1662s
16/9500 (epoch 0.084), train_loss = 3.35414872, grad/param norm = 1.1574e-01, time/batch = 0.1667s
17/9500 (epoch 0.089), train_loss = 3.30944307, grad/param norm = 1.3277e-01, time/batch = 0.1664s
18/9500 (epoch 0.095), train_loss = 3.31451027, grad/param norm = 1.2377e-01, time/batch = 0.1674s
19/9500 (epoch 0.100), train_loss = 3.34253825, grad/param norm = 1.6845e-01, time/batch = 0.1665s
20/9500 (epoch 0.105), train_loss = 3.37940360, grad/param norm = 2.1698e-01, time/batch = 0.1676s
21/9500 (epoch 0.111), train_loss = 3.34893832, grad/param norm = 2.4687e-01, time/batch = 0.1665s
22/9500 (epoch 0.116), train_loss = 3.31333926, grad/param norm = 1.6044e-01, time/batch = 0.1668s
23/9500 (epoch 0.121), train_loss = 3.31715436, grad/param norm = 1.2095e-01, time/batch = 0.1665s
24/9500 (epoch 0.126), train_loss = 3.29283764, grad/param norm = 8.9565e-02, time/batch = 0.1667s
25/9500 (epoch 0.132), train_loss = 3.30552235, grad/param norm = 9.6753e-02, time/batch = 0.1664s
26/9500 (epoch 0.137), train_loss = 3.31862877, grad/param norm = 9.3730e-02, time/batch = 0.1666s
27/9500 (epoch 0.142), train_loss = 3.31212713, grad/param norm = 1.0088e-01, time/batch = 0.1670s
28/9500 (epoch 0.147), train_loss = 3.30230550, grad/param norm = 1.0366e-01, time/batch = 0.1665s
29/9500 (epoch 0.153), train_loss = 3.34291882, grad/param norm = 1.6979e-01, time/batch = 0.1667s
30/9500 (epoch 0.158), train_loss = 3.31770495, grad/param norm = 1.1363e-01, time/batch = 0.1672s
31/9500 (epoch 0.163), train_loss = 3.32960788, grad/param norm = 1.1513e-01, time/batch = 0.1655s
32/9500 (epoch 0.168), train_loss = 3.29852395, grad/param norm = 9.4206e-02, time/batch = 0.1668s
33/9500 (epoch 0.174), train_loss = 3.30228260, grad/param norm = 8.0267e-02, time/batch = 0.1672s
34/9500 (epoch 0.179), train_loss = 3.29601358, grad/param norm = 8.9820e-02, time/batch = 0.1663s
35/9500 (epoch 0.184), train_loss = 3.30065951, grad/param norm = 9.2121e-02, time/batch = 0.1671s
36/9500 (epoch 0.189), train_loss = 3.30256522, grad/param norm = 1.0573e-01, time/batch = 0.1721s
37/9500 (epoch 0.195), train_loss = 3.28089412, grad/param norm = 1.0760e-01, time/batch = 0.1702s
38/9500 (epoch 0.200), train_loss = 3.29966182, grad/param norm = 1.1570e-01, time/batch = 0.1712s
39/9500 (epoch 0.205), train_loss = 3.30055597, grad/param norm = 1.0634e-01, time/batch = 0.1701s
40/9500 (epoch 0.211), train_loss = 3.29188272, grad/param norm = 1.2511e-01, time/batch = 0.1702s
41/9500 (epoch 0.216), train_loss = 3.34403879, grad/param norm = 1.4596e-01, time/batch = 0.1683s
42/9500 (epoch 0.221), train_loss = 3.32186155, grad/param norm = 1.2661e-01, time/batch = 0.1716s
43/9500 (epoch 0.226), train_loss = 3.29106616, grad/param norm = 9.1877e-02, time/batch = 0.1667s
44/9500 (epoch 0.232), train_loss = 3.28822525, grad/param norm = 8.0090e-02, time/batch = 0.1685s
45/9500 (epoch 0.237), train_loss = 3.27224633, grad/param norm = 5.8864e-02, time/batch = 0.1687s
46/9500 (epoch 0.242), train_loss = 3.30062181, grad/param norm = 5.4615e-02, time/batch = 0.1668s
47/9500 (epoch 0.247), train_loss = 3.27088080, grad/param norm = 5.3451e-02, time/batch = 0.1679s
48/9500 (epoch 0.253), train_loss = 3.29304520, grad/param norm = 6.2469e-02, time/batch = 0.1697s
49/9500 (epoch 0.258), train_loss = 3.26731319, grad/param norm = 7.5385e-02, time/batch = 0.1705s
50/9500 (epoch 0.263), train_loss = 3.29232175, grad/param norm = 7.4970e-02, time/batch = 0.1700s
51/9500 (epoch 0.268), train_loss = 3.27075258, grad/param norm = 7.3532e-02, time/batch = 0.1687s
52/9500 (epoch 0.274), train_loss = 3.28982578, grad/param norm = 6.6603e-02, time/batch = 0.1668s
53/9500 (epoch 0.279), train_loss = 3.27747978, grad/param norm = 5.7288e-02, time/batch = 0.1744s
54/9500 (epoch 0.284), train_loss = 3.27050254, grad/param norm = 4.7779e-02, time/batch = 0.1712s
55/9500 (epoch 0.289), train_loss = 3.24442264, grad/param norm = 6.0415e-02, time/batch = 0.1694s
56/9500 (epoch 0.295), train_loss = 3.27334135, grad/param norm = 7.2149e-02, time/batch = 0.1675s
57/9500 (epoch 0.300), train_loss = 3.30901288, grad/param norm = 5.3891e-02, time/batch = 0.1669s
58/9500 (epoch 0.305), train_loss = 3.28041874, grad/param norm = 4.6167e-02, time/batch = 0.1669s
59/9500 (epoch 0.311), train_loss = 3.29131866, grad/param norm = 6.2633e-02, time/batch = 0.1673s
60/9500 (epoch 0.316), train_loss = 3.27999372, grad/param norm = 5.9514e-02, time/batch = 0.1674s
61/9500 (epoch 0.321), train_loss = 3.26930972, grad/param norm = 5.6056e-02, time/batch = 0.1660s
62/9500 (epoch 0.326), train_loss = 3.29838292, grad/param norm = 5.3868e-02, time/batch = 0.1911s
63/9500 (epoch 0.332), train_loss = 3.32926925, grad/param norm = 4.7501e-02, time/batch = 0.1680s
64/9500 (epoch 0.337), train_loss = 3.33108291, grad/param norm = 6.0609e-02, time/batch = 0.1671s
65/9500 (epoch 0.342), train_loss = 3.30681935, grad/param norm = 9.0101e-02, time/batch = 0.1675s
66/9500 (epoch 0.347), train_loss = 3.27353297, grad/param norm = 9.1639e-02, time/batch = 0.1671s
67/9500 (epoch 0.353), train_loss = 3.27974243, grad/param norm = 8.6865e-02, time/batch = 0.1669s
68/9500 (epoch 0.358), train_loss = 3.27348555, grad/param norm = 8.7908e-02, time/batch = 0.1670s
69/9500 (epoch 0.363), train_loss = 3.27757451, grad/param norm = 1.0084e-01, time/batch = 0.1671s
70/9500 (epoch 0.368), train_loss = 3.24274239, grad/param norm = 1.0522e-01, time/batch = 0.1669s
71/9500 (epoch 0.374), train_loss = 3.29016263, grad/param norm = 1.1343e-01, time/batch = 0.1657s
72/9500 (epoch 0.379), train_loss = 3.28759299, grad/param norm = 7.7749e-02, time/batch = 0.1671s
73/9500 (epoch 0.384), train_loss = 3.27209283, grad/param norm = 8.4547e-02, time/batch = 0.1673s
74/9500 (epoch 0.389), train_loss = 3.29661244, grad/param norm = 8.3727e-02, time/batch = 0.1675s
75/9500 (epoch 0.395), train_loss = 3.30226967, grad/param norm = 7.6405e-02, time/batch = 0.1681s
76/9500 (epoch 0.400), train_loss = 3.27048019, grad/param norm = 6.0030e-02, time/batch = 0.1671s
77/9500 (epoch 0.405), train_loss = 3.27307415, grad/param norm = 6.1161e-02, time/batch = 0.1670s
78/9500 (epoch 0.411), train_loss = 3.30788460, grad/param norm = 4.7150e-02, time/batch = 0.1678s
79/9500 (epoch 0.416), train_loss = 3.23867290, grad/param norm = 5.6643e-02, time/batch = 0.1670s
80/9500 (epoch 0.421), train_loss = 3.24106041, grad/param norm = 6.3663e-02, time/batch = 0.1672s
81/9500 (epoch 0.426), train_loss = 3.29023742, grad/param norm = 6.2061e-02, time/batch = 0.1661s
82/9500 (epoch 0.432), train_loss = 3.28714083, grad/param norm = 8.6508e-02, time/batch = 0.1670s
83/9500 (epoch 0.437), train_loss = 3.24114305, grad/param norm = 8.3357e-02, time/batch = 0.1673s
84/9500 (epoch 0.442), train_loss = 3.23749672, grad/param norm = 6.5806e-02, time/batch = 0.1680s
85/9500 (epoch 0.447), train_loss = 3.26320014, grad/param norm = 5.1585e-02, time/batch = 0.1671s
86/9500 (epoch 0.453), train_loss = 3.25714559, grad/param norm = 5.0439e-02, time/batch = 0.1672s
87/9500 (epoch 0.458), train_loss = 3.28365665, grad/param norm = 5.9689e-02, time/batch = 0.1678s
88/9500 (epoch 0.463), train_loss = 3.30418399, grad/param norm = 5.8412e-02, time/batch = 0.1672s
89/9500 (epoch 0.468), train_loss = 3.27882853, grad/param norm = 7.6653e-02, time/batch = 0.1673s
90/9500 (epoch 0.474), train_loss = 3.30339777, grad/param norm = 7.4193e-02, time/batch = 0.1672s
91/9500 (epoch 0.479), train_loss = 3.26227548, grad/param norm = 8.8452e-02, time/batch = 0.1657s
92/9500 (epoch 0.484), train_loss = 3.27803219, grad/param norm = 6.9058e-02, time/batch = 0.1673s
93/9500 (epoch 0.489), train_loss = 3.29130895, grad/param norm = 5.6408e-02, time/batch = 0.1672s
94/9500 (epoch 0.495), train_loss = 3.27140417, grad/param norm = 6.2443e-02, time/batch = 0.1662s
95/9500 (epoch 0.500), train_loss = 3.29444590, grad/param norm = 7.6994e-02, time/batch = 0.1671s
96/9500 (epoch 0.505), train_loss = 3.27935846, grad/param norm = 7.8517e-02, time/batch = 0.1674s
97/9500 (epoch 0.511), train_loss = 3.30168230, grad/param norm = 6.7420e-02, time/batch = 0.1675s
98/9500 (epoch 0.516), train_loss = 3.28696365, grad/param norm = 6.3729e-02, time/batch = 0.1672s
99/9500 (epoch 0.521), train_loss = 3.27213309, grad/param norm = 5.0801e-02, time/batch = 0.1671s
100/9500 (epoch 0.526), train_loss = 3.31575204, grad/param norm = 5.7709e-02, time/batch = 0.1679s 101/9500 (epoch 0.532), train_loss = 3.33272940, grad/param norm = 6.2874e-02, time/batch = 0.1655s 102/9500 (epoch 0.537), train_loss = 3.28259263, grad/param norm = 4.7357e-02, time/batch = 0.1673s 103/9500 (epoch 0.542), train_loss = 3.21872404, grad/param norm = 5.3481e-02, time/batch = 0.1675s 104/9500 (epoch 0.547), train_loss = 3.28609322, grad/param norm = 5.2550e-02, time/batch = 0.1674s 105/9500 (epoch 0.553), train_loss = 3.25417182, grad/param norm = 6.0065e-02, time/batch = 0.1670s 106/9500 (epoch 0.558), train_loss = 3.25648448, grad/param norm = 6.4893e-02, time/batch = 0.1677s 107/9500 (epoch 0.563), train_loss = 3.27050845, grad/param norm = 9.6045e-02, time/batch = 0.1673s 108/9500 (epoch 0.568), train_loss = 3.28357748, grad/param norm = 7.5897e-02, time/batch = 0.1672s 109/9500 (epoch 0.574), train_loss = 3.25293633, grad/param norm = 6.2269e-02, time/batch = 0.1678s 110/9500 (epoch 0.579), train_loss = 3.24648492, grad/param norm = 6.4532e-02, time/batch = 0.1671s 111/9500 (epoch 0.584), train_loss = 3.24045322, grad/param norm = 7.7416e-02, time/batch = 0.1657s 112/9500 (epoch 0.589), train_loss = 3.44701844, grad/param norm = 3.8647e-01, time/batch = 0.1673s 113/9500 (epoch 0.595), train_loss = 3.37163268, grad/param norm = 1.4331e-01, time/batch = 0.1672s 114/9500 (epoch 0.600), train_loss = 3.30353757, grad/param norm = 8.6149e-02, time/batch = 0.1675s 115/9500 (epoch 0.605), train_loss = 3.26892823, grad/param norm = 6.9791e-02, time/batch = 0.1670s 116/9500 (epoch 0.611), train_loss = 3.23298790, grad/param norm = 5.0945e-02, time/batch = 0.1680s 117/9500 (epoch 0.616), train_loss = 3.29405540, grad/param norm = 7.9216e-02, time/batch = 0.1675s 118/9500 (epoch 0.621), train_loss = 3.33130064, grad/param norm = 6.9648e-02, time/batch = 0.1675s 119/9500 (epoch 0.626), train_loss = 3.26412619, grad/param norm = 8.3202e-02, time/batch = 0.1678s 120/9500 (epoch 0.632), train_loss = 3.23351569, grad/param norm = 7.6535e-02, time/batch = 0.1672s 121/9500 (epoch 0.637), train_loss = 3.24036087, grad/param norm = 1.0496e-01, time/batch = 0.1660s 122/9500 (epoch 0.642), train_loss = 3.26624198, grad/param norm = 1.0500e-01, time/batch = 0.1680s 123/9500 (epoch 0.647), train_loss = 3.24303390, grad/param norm = 7.3783e-02, time/batch = 0.1674s 124/9500 (epoch 0.653), train_loss = 3.26324116, grad/param norm = 8.0746e-02, time/batch = 0.1673s 125/9500 (epoch 0.658), train_loss = 3.22223159, grad/param norm = 6.1081e-02, time/batch = 0.1678s 126/9500 (epoch 0.663), train_loss = 3.21170182, grad/param norm = 6.6848e-02, time/batch = 0.1672s 127/9500 (epoch 0.668), train_loss = 3.23330923, grad/param norm = 7.1474e-02, time/batch = 0.1674s 128/9500 (epoch 0.674), train_loss = 3.23585022, grad/param norm = 9.4587e-02, time/batch = 0.1671s 129/9500 (epoch 0.679), train_loss = 3.23146753, grad/param norm = 9.9437e-02, time/batch = 0.1677s 130/9500 (epoch 0.684), train_loss = 3.20319606, grad/param norm = 6.6245e-02, time/batch = 0.1671s 131/9500 (epoch 0.689), train_loss = 3.19942729, grad/param norm = 7.1896e-02, time/batch = 0.1660s 132/9500 (epoch 0.695), train_loss = 3.20248689, grad/param norm = 8.4354e-02, time/batch = 0.1675s 133/9500 (epoch 0.700), train_loss = 3.17376431, grad/param norm = 7.8533e-02, time/batch = 0.1672s 134/9500 (epoch 0.705), train_loss = 3.17502874, grad/param norm = 8.7593e-02, time/batch = 0.1672s 135/9500 (epoch 0.711), train_loss = 3.17025857, grad/param norm = 9.5012e-02, time/batch = 0.1679s 136/9500 (epoch 0.716), train_loss = 3.14070362, grad/param norm = 1.2444e-01, time/batch = 0.1671s 137/9500 (epoch 0.721), train_loss = 3.15086621, grad/param norm = 1.3916e-01, time/batch = 0.1675s 138/9500 (epoch 0.726), train_loss = 3.13574181, grad/param norm = 1.3700e-01, time/batch = 0.1680s 139/9500 (epoch 0.732), train_loss = 3.12685347, grad/param norm = 1.7680e-01, time/batch = 0.1677s 140/9500 (epoch 0.737), train_loss = 3.20465042, grad/param norm = 1.9024e-01, time/batch = 0.1674s 141/9500 (epoch 0.742), train_loss = 3.23466569, grad/param norm = 1.6682e-01, time/batch = 0.1664s 142/9500 (epoch 0.747), train_loss = 3.16408956, grad/param norm = 1.1285e-01, time/batch = 0.1675s 143/9500 (epoch 0.753), train_loss = 3.23678133, grad/param norm = 1.6472e-01, time/batch = 0.1675s 144/9500 (epoch 0.758), train_loss = 3.15593347, grad/param norm = 1.0136e-01, time/batch = 0.1679s 145/9500 (epoch 0.763), train_loss = 3.07189437, grad/param norm = 6.9025e-02, time/batch = 0.1674s 146/9500 (epoch 0.768), train_loss = 3.04447907, grad/param norm = 8.9196e-02, time/batch = 0.1676s 147/9500 (epoch 0.774), train_loss = 3.05330240, grad/param norm = 1.2684e-01, time/batch = 0.1680s 148/9500 (epoch 0.779), train_loss = 3.05118311, grad/param norm = 1.6778e-01, time/batch = 0.1675s 149/9500 (epoch 0.784), train_loss = 3.05733690, grad/param norm = 2.0024e-01, time/batch = 0.1672s 150/9500 (epoch 0.789), train_loss = 3.02148660, grad/param norm = 1.7343e-01, time/batch = 0.1671s 151/9500 (epoch 0.795), train_loss = 3.00491152, grad/param norm = 1.1277e-01, time/batch = 0.1660s 152/9500 (epoch 0.800), train_loss = 3.21407479, grad/param norm = 5.5699e-01, time/batch = 0.1674s 153/9500 (epoch 0.805), train_loss = 3.32414680, grad/param norm = 3.4037e-01, time/batch = 0.1676s 154/9500 (epoch 0.811), train_loss = 3.27463062, grad/param norm = 2.1281e-01, time/batch = 0.1675s 155/9500 (epoch 0.816), train_loss = 3.06618207, grad/param norm = 9.3634e-02, time/batch = 0.1672s 156/9500 (epoch 0.821), train_loss = 3.02144634, grad/param norm = 6.2442e-02, time/batch = 0.1673s 157/9500 (epoch 0.826), train_loss = 2.95256076, grad/param norm = 9.0091e-02, time/batch = 0.1683s 158/9500 (epoch 0.832), train_loss = 2.98143281, grad/param norm = 1.1424e-01, time/batch = 0.1672s 159/9500 (epoch 0.837), train_loss = 2.94519732, grad/param norm = 8.5923e-02, time/batch = 0.1678s 160/9500 (epoch 0.842), train_loss = 2.94131310, grad/param norm = 1.0222e-01, time/batch = 0.1677s 161/9500 (epoch 0.847), train_loss = 2.93474691, grad/param norm = 9.8161e-02, time/batch = 0.1662s 162/9500 (epoch 0.853), train_loss = 2.90784111, grad/param norm = 1.2394e-01, time/batch = 0.1675s 163/9500 (epoch 0.858), train_loss = 3.07049302, grad/param norm = 4.6730e-01, time/batch = 0.1682s 164/9500 (epoch 0.863), train_loss = 3.03834999, grad/param norm = 2.4975e-01, time/batch = 0.1677s 165/9500 (epoch 0.868), train_loss = 2.93912546, grad/param norm = 1.3055e-01, time/batch = 0.1672s 166/9500 (epoch 0.874), train_loss = 2.90044387, grad/param norm = 1.3705e-01, time/batch = 0.1681s 167/9500 (epoch 0.879), train_loss = 2.86155170, grad/param norm = 9.9814e-02, time/batch = 0.1675s 168/9500 (epoch 0.884), train_loss = 2.85820678, grad/param norm = 1.4036e-01, time/batch = 0.1678s 169/9500 (epoch 0.889), train_loss = 2.85024148, grad/param norm = 1.2539e-01, time/batch = 0.1673s 170/9500 (epoch 0.895), train_loss = 2.84438551, grad/param norm = 1.3375e-01, time/batch = 0.1679s 171/9500 (epoch 0.900), train_loss = 2.82365211, grad/param norm = 1.1715e-01, time/batch = 0.1658s 172/9500 (epoch 0.905), train_loss = 2.82259930, grad/param norm = 1.1469e-01, time/batch = 0.1677s 173/9500 (epoch 0.911), train_loss = 2.79364060, grad/param norm = 1.2573e-01, time/batch = 0.1679s 174/9500 (epoch 0.916), train_loss = 2.81964291, grad/param norm = 2.2931e-01, time/batch = 0.1676s 175/9500 (epoch 0.921), train_loss = 2.89575767, grad/param norm = 3.9404e-01, time/batch = 0.1674s 176/9500 (epoch 0.926), train_loss = 2.85975401, grad/param norm = 2.3007e-01, time/batch = 0.1679s 177/9500 (epoch 0.932), train_loss = 2.84219930, grad/param norm = 3.1699e-01, time/batch = 0.1675s 178/9500 (epoch 0.937), train_loss = 2.90646558, grad/param norm = 1.7903e-01, time/batch = 0.1676s 179/9500 (epoch 0.942), train_loss = 2.79684171, grad/param norm = 1.0125e-01, time/batch = 0.1683s 180/9500 (epoch 0.947), train_loss = 2.79794559, grad/param norm = 1.1338e-01, time/batch = 0.1673s 181/9500 (epoch 0.953), train_loss = 2.77543377, grad/param norm = 9.7725e-02, time/batch = 0.1661s 182/9500 (epoch 0.958), train_loss = 2.73258759, grad/param norm = 8.6316e-02, time/batch = 0.1684s 183/9500 (epoch 0.963), train_loss = 2.72224693, grad/param norm = 7.8525e-02, time/batch = 0.1676s 184/9500 (epoch 0.968), train_loss = 2.69406896, grad/param norm = 1.0477e-01, time/batch = 0.1677s 185/9500 (epoch 0.974), train_loss = 2.72115661, grad/param norm = 2.1522e-01, time/batch = 0.1676s 186/9500 (epoch 0.979), train_loss = 2.83466770, grad/param norm = 4.1144e-01, time/batch = 0.1677s 187/9500 (epoch 0.984), train_loss = 2.80450796, grad/param norm = 2.2144e-01, time/batch = 0.1674s 188/9500 (epoch 0.989), train_loss = 2.73207733, grad/param norm = 1.3759e-01, time/batch = 0.1677s 189/9500 (epoch 0.995), train_loss = 2.70650164, grad/param norm = 1.0955e-01, time/batch = 0.1675s 190/9500 (epoch 1.000), train_loss = 2.71835902, grad/param norm = 1.9360e-01, time/batch = 0.1674s 191/9500 (epoch 1.005), train_loss = 2.80218973, grad/param norm = 1.9382e-01, time/batch = 0.1660s 192/9500 (epoch 1.011), train_loss = 2.76163413, grad/param norm = 1.5825e-01, time/batch = 0.1676s 193/9500 (epoch 1.016), train_loss = 2.72953795, grad/param norm = 1.4570e-01, time/batch = 0.1676s 194/9500 (epoch 1.021), train_loss = 2.65865903, grad/param norm = 1.3027e-01, time/batch = 0.1673s 195/9500 (epoch 1.026), train_loss = 2.65947048, grad/param norm = 1.2876e-01, time/batch = 0.1681s 196/9500 (epoch 1.032), train_loss = 2.63516525, grad/param norm = 1.5726e-01, time/batch = 0.1679s 197/9500 (epoch 1.037), train_loss = 2.65844296, grad/param norm = 1.9885e-01, time/batch = 0.1677s 198/9500 (epoch 1.042), train_loss = 2.66414303, grad/param norm = 2.0874e-01, time/batch = 0.1679s 199/9500 (epoch 1.047), train_loss = 2.66406861, grad/param norm = 1.5419e-01, time/batch = 0.1673s 200/9500 (epoch 1.053), train_loss = 2.61099810, grad/param norm = 1.0986e-01, time/batch = 0.1676s 201/9500 (epoch 1.058), train_loss = 2.60735827, grad/param norm = 1.3129e-01, time/batch = 0.1670s 202/9500 (epoch 1.063), train_loss = 2.64200086, grad/param norm = 1.9102e-01, time/batch = 0.1673s 203/9500 (epoch 1.068), train_loss = 2.71690869, grad/param norm = 2.1875e-01, time/batch = 0.1674s 204/9500 (epoch 1.074), train_loss = 2.72020114, grad/param norm = 2.4086e-01, time/batch = 0.1679s 205/9500 (epoch 1.079), train_loss = 2.70175859, grad/param norm = 2.4647e-01, time/batch = 0.1674s 206/9500 (epoch 1.084), train_loss = 2.70258088, grad/param norm = 2.2883e-01, time/batch = 0.1673s 207/9500 (epoch 1.089), train_loss = 2.64367289, grad/param norm = 1.6997e-01, time/batch = 0.1673s 208/9500 (epoch 1.095), train_loss = 2.58395955, grad/param norm = 1.1052e-01, time/batch = 0.1676s 209/9500 (epoch 1.100), train_loss = 2.57970488, grad/param norm = 8.6612e-02, time/batch = 0.1674s 210/9500 (epoch 1.105), train_loss = 2.57031104, grad/param norm = 7.7425e-02, time/batch = 0.1673s 211/9500 (epoch 1.111), train_loss = 2.51584864, grad/param norm = 8.2444e-02, time/batch = 0.1660s 212/9500 (epoch 1.116), train_loss = 2.54540197, grad/param norm = 1.5164e-01, time/batch = 0.1679s 213/9500 (epoch 1.121), train_loss = 2.65656773, grad/param norm = 1.8741e-01, time/batch = 0.1673s 214/9500 (epoch 1.126), train_loss = 2.64668360, grad/param norm = 1.6577e-01, time/batch = 0.1673s 215/9500 (epoch 1.132), train_loss = 2.60635977, grad/param norm = 1.3135e-01, time/batch = 0.1672s 216/9500 (epoch 1.137), train_loss = 2.55764280, grad/param norm = 1.0156e-01, time/batch = 0.1671s 217/9500 (epoch 1.142), train_loss = 2.52958158, grad/param norm = 1.7944e-01, time/batch = 0.1677s 218/9500 (epoch 1.147), train_loss = 2.66934768, grad/param norm = 4.2714e-01, time/batch = 0.1675s 219/9500 (epoch 1.153), train_loss = 2.78780397, grad/param norm = 3.2570e-01, time/batch = 0.1675s 220/9500 (epoch 1.158), train_loss = 2.61872714, grad/param norm = 1.4087e-01, time/batch = 0.1680s 221/9500 (epoch 1.163), train_loss = 2.57246022, grad/param norm = 1.0808e-01, time/batch = 0.1659s 222/9500 (epoch 1.168), train_loss = 2.55501173, grad/param norm = 1.2637e-01, time/batch = 0.1679s 223/9500 (epoch 1.174), train_loss = 2.56719114, grad/param norm = 1.2544e-01, time/batch = 0.1679s 224/9500 (epoch 1.179), train_loss = 2.57267858, grad/param norm = 1.1674e-01, time/batch = 0.1672s 225/9500 (epoch 1.184), train_loss = 2.52594102, grad/param norm = 1.0010e-01, time/batch = 0.1676s 226/9500 (epoch 1.189), train_loss = 2.50786604, grad/param norm = 9.6824e-02, time/batch = 0.1676s 227/9500 (epoch 1.195), train_loss = 2.47613595, grad/param norm = 8.4033e-02, time/batch = 0.1683s 228/9500 (epoch 1.200), train_loss = 2.47743491, grad/param norm = 7.6428e-02, time/batch = 0.1680s 229/9500 (epoch 1.205), train_loss = 2.44575243, grad/param norm = 8.4128e-02, time/batch = 0.1691s 230/9500 (epoch 1.211), train_loss = 2.46043536, grad/param norm = 1.3657e-01, time/batch = 0.1698s 231/9500 (epoch 1.216), train_loss = 2.54583766, grad/param norm = 2.1191e-01, time/batch = 0.1670s 232/9500 (epoch 1.221), train_loss = 2.55123938, grad/param norm = 2.4174e-01, time/batch = 0.1693s 233/9500 (epoch 1.226), train_loss = 2.55736139, grad/param norm = 1.5332e-01, time/batch = 0.1695s 234/9500 (epoch 1.232), train_loss = 2.51756784, grad/param norm = 8.4411e-02, time/batch = 0.1693s 235/9500 (epoch 1.237), train_loss = 2.49024255, grad/param norm = 8.1373e-02, time/batch = 0.1688s 236/9500 (epoch 1.242), train_loss = 2.51164486, grad/param norm = 9.1590e-02, time/batch = 0.1699s 237/9500 (epoch 1.247), train_loss = 2.43962918, grad/param norm = 9.3577e-02, time/batch = 0.1692s 238/9500 (epoch 1.253), train_loss = 2.47471191, grad/param norm = 1.4283e-01, time/batch = 0.1693s 239/9500 (epoch 1.258), train_loss = 2.52366169, grad/param norm = 2.1839e-01, time/batch = 0.1695s 240/9500 (epoch 1.263), train_loss = 2.49762992, grad/param norm = 2.0596e-01, time/batch = 0.1692s 241/9500 (epoch 1.268), train_loss = 2.48226185, grad/param norm = 1.4365e-01, time/batch = 0.1684s 242/9500 (epoch 1.274), train_loss = 2.46790069, grad/param norm = 1.2924e-01, time/batch = 0.1698s 243/9500 (epoch 1.279), train_loss = 2.46328343, grad/param norm = 1.1304e-01, time/batch = 0.1687s 244/9500 (epoch 1.284), train_loss = 2.46148424, grad/param norm = 1.0504e-01, time/batch = 0.1690s 245/9500 (epoch 1.289), train_loss = 2.46717345, grad/param norm = 1.2091e-01, time/batch = 0.1692s 246/9500 (epoch 1.295), train_loss = 2.45412949, grad/param norm = 1.8271e-01, time/batch = 0.1687s 247/9500 (epoch 1.300), train_loss = 2.53487761, grad/param norm = 1.8479e-01, time/batch = 0.1693s 248/9500 (epoch 1.305), train_loss = 2.50844855, grad/param norm = 1.9040e-01, time/batch = 0.1689s 249/9500 (epoch 1.311), train_loss = 2.54239161, grad/param norm = 1.9795e-01, time/batch = 0.1693s 250/9500 (epoch 1.316), train_loss = 2.47290387, grad/param norm = 1.1617e-01, time/batch = 0.1695s 251/9500 (epoch 1.321), train_loss = 2.43684184, grad/param norm = 1.2325e-01, time/batch = 0.1673s 252/9500 (epoch 1.326), train_loss = 2.48042869, grad/param norm = 1.1871e-01, time/batch = 0.1688s 253/9500 (epoch 1.332), train_loss = 2.42779499, grad/param norm = 1.1851e-01, time/batch = 0.1692s 254/9500 (epoch 1.337), train_loss = 2.41981034, grad/param norm = 1.1964e-01, time/batch = 0.1689s 255/9500 (epoch 1.342), train_loss = 2.40988381, grad/param norm = 1.1759e-01, time/batch = 0.1697s 256/9500 (epoch 1.347), train_loss = 2.42414392, grad/param norm = 1.1517e-01, time/batch = 0.1691s 257/9500 (epoch 1.353), train_loss = 2.41504769, grad/param norm = 9.3772e-02, time/batch = 0.1691s 258/9500 (epoch 1.358), train_loss = 2.38073062, grad/param norm = 9.8476e-02, time/batch = 0.1696s 259/9500 (epoch 1.363), train_loss = 2.41499375, grad/param norm = 1.1082e-01, time/batch = 0.1692s 260/9500 (epoch 1.368), train_loss = 2.41186593, grad/param norm = 1.3049e-01, time/batch = 0.1690s 261/9500 (epoch 1.374), train_loss = 2.42577524, grad/param norm = 1.3320e-01, time/batch = 0.1681s 262/9500 (epoch 1.379), train_loss = 2.37677479, grad/param norm = 1.4066e-01, time/batch = 0.1690s 263/9500 (epoch 1.384), train_loss = 2.46353652, grad/param norm = 2.6515e-01, time/batch = 0.1691s 264/9500 (epoch 1.389), train_loss = 2.50882018, grad/param norm = 1.9886e-01, time/batch = 0.1698s 265/9500 (epoch 1.395), train_loss = 2.44381573, grad/param norm = 1.0059e-01, time/batch = 0.1685s 266/9500 (epoch 1.400), train_loss = 2.42095001, grad/param norm = 8.4894e-02, time/batch = 0.1688s 267/9500 (epoch 1.405), train_loss = 2.40441904, grad/param norm = 9.0272e-02, time/batch = 0.1697s 268/9500 (epoch 1.411), train_loss = 2.41696162, grad/param norm = 1.3976e-01, time/batch = 0.1690s 269/9500 (epoch 1.416), train_loss = 2.46135211, grad/param norm = 1.8336e-01, time/batch = 0.1693s 270/9500 (epoch 1.421), train_loss = 2.49511308, grad/param norm = 1.4453e-01, time/batch = 0.1688s 271/9500 (epoch 1.426), train_loss = 2.38353456, grad/param norm = 8.4670e-02, time/batch = 0.1677s 272/9500 (epoch 1.432), train_loss = 2.38605347, grad/param norm = 8.6230e-02, time/batch = 0.1693s 273/9500 (epoch 1.437), train_loss = 2.38462159, grad/param norm = 9.5548e-02, time/batch = 0.1688s 274/9500 (epoch 1.442), train_loss = 2.34584232, grad/param norm = 1.1090e-01, time/batch = 0.1689s 275/9500 (epoch 1.447), train_loss = 2.38295779, grad/param norm = 1.1715e-01, time/batch = 0.1691s 276/9500 (epoch 1.453), train_loss = 2.38875564, grad/param norm = 1.2981e-01, time/batch = 0.1691s 277/9500 (epoch 1.458), train_loss = 2.35181808, grad/param norm = 1.2308e-01, time/batch = 0.1699s 278/9500 (epoch 1.463), train_loss = 2.38032283, grad/param norm = 9.9003e-02, time/batch = 0.1687s 279/9500 (epoch 1.468), train_loss = 2.35522679, grad/param norm = 8.7687e-02, time/batch = 0.1691s 280/9500 (epoch 1.474), train_loss = 2.37795367, grad/param norm = 1.0974e-01, time/batch = 0.1700s 281/9500 (epoch 1.479), train_loss = 2.38125779, grad/param norm = 1.5710e-01, time/batch = 0.1675s 282/9500 (epoch 1.484), train_loss = 2.40863575, grad/param norm = 1.6572e-01, time/batch = 0.1688s 283/9500 (epoch 1.489), train_loss = 2.38205218, grad/param norm = 1.3572e-01, time/batch = 0.1687s 284/9500 (epoch 1.495), train_loss = 2.36070232, grad/param norm = 1.0946e-01, time/batch = 0.1694s 285/9500 (epoch 1.500), train_loss = 2.35871841, grad/param norm = 1.0983e-01, time/batch = 0.1688s 286/9500 (epoch 1.505), train_loss = 2.41650473, grad/param norm = 1.3359e-01, time/batch = 0.1690s 287/9500 (epoch 1.511), train_loss = 2.38341896, grad/param norm = 1.2712e-01, time/batch = 0.1703s 288/9500 (epoch 1.516), train_loss = 2.36407164, grad/param norm = 1.1777e-01, time/batch = 0.1689s 289/9500 (epoch 1.521), train_loss = 2.35431231, grad/param norm = 1.2503e-01, time/batch = 0.1694s 290/9500 (epoch 1.526), train_loss = 2.37336952, grad/param norm = 1.5647e-01, time/batch = 0.1696s 291/9500 (epoch 1.532), train_loss = 2.39668250, grad/param norm = 1.3588e-01, time/batch = 0.1675s 292/9500 (epoch 1.537), train_loss = 2.34678617, grad/param norm = 1.0435e-01, time/batch = 0.1692s 293/9500 (epoch 1.542), train_loss = 2.31226676, grad/param norm = 9.0857e-02, time/batch = 0.1694s 294/9500 (epoch 1.547), train_loss = 2.31894064, grad/param norm = 9.6763e-02, time/batch = 0.1686s 295/9500 (epoch 1.553), train_loss = 2.33343775, grad/param norm = 9.9626e-02, time/batch = 0.1693s 296/9500 (epoch 1.558), train_loss = 2.29599771, grad/param norm = 1.0849e-01, time/batch = 0.1691s 297/9500 (epoch 1.563), train_loss = 2.30961491, grad/param norm = 1.3189e-01, time/batch = 0.1690s 298/9500 (epoch 1.568), train_loss = 2.33088882, grad/param norm = 1.3767e-01, time/batch = 0.1689s 299/9500 (epoch 1.574), train_loss = 2.32800305, grad/param norm = 1.3919e-01, time/batch = 0.1691s 300/9500 (epoch 1.579), train_loss = 2.35925177, grad/param norm = 1.2985e-01, time/batch = 0.1691s 301/9500 (epoch 1.584), train_loss = 2.33194792, grad/param norm = 1.0772e-01, time/batch = 0.1674s 302/9500 (epoch 1.589), train_loss = 2.30412334, grad/param norm = 1.1887e-01, time/batch = 0.1696s 303/9500 (epoch 1.595), train_loss = 2.35141436, grad/param norm = 1.2923e-01, time/batch = 0.1690s 304/9500 (epoch 1.600), train_loss = 2.34429995, grad/param norm = 1.1307e-01, time/batch = 0.1691s 305/9500 (epoch 1.605), train_loss = 2.28654737, grad/param norm = 9.4583e-02, time/batch = 0.1692s 306/9500 (epoch 1.611), train_loss = 2.28441040, grad/param norm = 8.3021e-02, time/batch = 0.1688s 307/9500 (epoch 1.616), train_loss = 2.26679949, grad/param norm = 8.9095e-02, time/batch = 0.1687s 308/9500 (epoch 1.621), train_loss = 2.29492144, grad/param norm = 1.2933e-01, time/batch = 0.1687s 309/9500 (epoch 1.626), train_loss = 2.30607644, grad/param norm = 1.4136e-01, time/batch = 0.1697s 310/9500 (epoch 1.632), train_loss = 2.29290429, grad/param norm = 1.1710e-01, time/batch = 0.1691s 311/9500 (epoch 1.637), train_loss = 2.26664954, grad/param norm = 8.5261e-02, time/batch = 0.1674s 312/9500 (epoch 1.642), train_loss = 2.25645036, grad/param norm = 6.4060e-02, time/batch = 0.1691s 313/9500 (epoch 1.647), train_loss = 2.24618306, grad/param norm = 6.6768e-02, time/batch = 0.1691s 314/9500 (epoch 1.653), train_loss = 2.27145154, grad/param norm = 8.2980e-02, time/batch = 0.1692s 315/9500 (epoch 1.658), train_loss = 2.29957117, grad/param norm = 1.1043e-01, time/batch = 0.1697s 316/9500 (epoch 1.663), train_loss = 2.35174397, grad/param norm = 1.3420e-01, time/batch = 0.1689s 317/9500 (epoch 1.668), train_loss = 2.34038885, grad/param norm = 1.0814e-01, time/batch = 0.1693s 318/9500 (epoch 1.674), train_loss = 2.30957868, grad/param norm = 7.4915e-02, time/batch = 0.1697s 319/9500 (epoch 1.679), train_loss = 2.30160265, grad/param norm = 6.8551e-02, time/batch = 0.1687s 320/9500 (epoch 1.684), train_loss = 2.26268365, grad/param norm = 7.9227e-02, time/batch = 0.1692s 321/9500 (epoch 1.689), train_loss = 2.28016263, grad/param norm = 1.2030e-01, time/batch = 0.1678s 322/9500 (epoch 1.695), train_loss = 2.31945728, grad/param norm = 1.6381e-01, time/batch = 0.1692s 323/9500 (epoch 1.700), train_loss = 2.32628205, grad/param norm = 1.4136e-01, time/batch = 0.1691s 324/9500 (epoch 1.705), train_loss = 2.29548016, grad/param norm = 9.2443e-02, time/batch = 0.1700s 325/9500 (epoch 1.711), train_loss = 2.25172894, grad/param norm = 8.4460e-02, time/batch = 0.1691s 326/9500 (epoch 1.716), train_loss = 2.26378751, grad/param norm = 9.2974e-02, time/batch = 0.1693s 327/9500 (epoch 1.721), train_loss = 2.22517570, grad/param norm = 9.7820e-02, time/batch = 0.1694s 328/9500 (epoch 1.726), train_loss = 2.22391799, grad/param norm = 8.2037e-02, time/batch = 0.1687s 329/9500 (epoch 1.732), train_loss = 2.20956461, grad/param norm = 7.0078e-02, time/batch = 0.1691s 330/9500 (epoch 1.737), train_loss = 2.23581047, grad/param norm = 6.8547e-02, time/batch = 0.1694s 331/9500 (epoch 1.742), train_loss = 2.22350775, grad/param norm = 8.3162e-02, time/batch = 0.1676s 332/9500 (epoch 1.747), train_loss = 2.25299297, grad/param norm = 1.1441e-01, time/batch = 0.1698s 333/9500 (epoch 1.753), train_loss = 2.26323633, grad/param norm = 1.2287e-01, time/batch = 0.1701s 334/9500 (epoch 1.758), train_loss = 2.26779199, grad/param norm = 9.6329e-02, time/batch = 0.1691s 335/9500 (epoch 1.763), train_loss = 2.19086006, grad/param norm = 8.1474e-02, time/batch = 0.1689s 336/9500 (epoch 1.768), train_loss = 2.21199599, grad/param norm = 9.8913e-02, time/batch = 0.1691s 337/9500 (epoch 1.774), train_loss = 2.23367218, grad/param norm = 1.1997e-01, time/batch = 0.1683s 338/9500 (epoch 1.779), train_loss = 2.31054295, grad/param norm = 1.6884e-01, time/batch = 0.1692s 339/9500 (epoch 1.784), train_loss = 2.34693111, grad/param norm = 1.7690e-01, time/batch = 0.1693s 340/9500 (epoch 1.789), train_loss = 2.27534557, grad/param norm = 1.1911e-01, time/batch = 0.1701s 341/9500 (epoch 1.795), train_loss = 2.25239140, grad/param norm = 1.3636e-01, time/batch = 0.1673s 342/9500 (epoch 1.800), train_loss = 2.24585765, grad/param norm = 1.3846e-01, time/batch = 0.1692s 343/9500 (epoch 1.805), train_loss = 2.17387921, grad/param norm = 9.6608e-02, time/batch = 0.1698s 344/9500 (epoch 1.811), train_loss = 2.20758753, grad/param norm = 8.4712e-02, time/batch = 0.1690s 345/9500 (epoch 1.816), train_loss = 2.16322903, grad/param norm = 8.7759e-02, time/batch = 0.1690s 346/9500 (epoch 1.821), train_loss = 2.19312935, grad/param norm = 8.1304e-02, time/batch = 0.1699s 347/9500 (epoch 1.826), train_loss = 2.19382331, grad/param norm = 9.0339e-02, time/batch = 0.1690s 348/9500 (epoch 1.832), train_loss = 2.19797687, grad/param norm = 9.7263e-02, time/batch = 0.1688s 349/9500 (epoch 1.837), train_loss = 2.19412732, grad/param norm = 7.8655e-02, time/batch = 0.1697s 350/9500 (epoch 1.842), train_loss = 2.18391168, grad/param norm = 6.5144e-02, time/batch = 0.1688s 351/9500 (epoch 1.847), train_loss = 2.17512759, grad/param norm = 6.2824e-02, time/batch = 0.1678s 352/9500 (epoch 1.853), train_loss = 2.16151414, grad/param norm = 6.5225e-02, time/batch = 0.1693s 353/9500 (epoch 1.858), train_loss = 2.18454594, grad/param norm = 7.4765e-02, time/batch = 0.1693s 354/9500 (epoch 1.863), train_loss = 2.16541689, grad/param norm = 8.1228e-02, time/batch = 0.1690s 355/9500 (epoch 1.868), train_loss = 2.16169660, grad/param norm = 1.0157e-01, time/batch = 0.1697s 356/9500 (epoch 1.874), train_loss = 2.22497355, grad/param norm = 1.3158e-01, time/batch = 0.1687s 357/9500 (epoch 1.879), train_loss = 2.20618470, grad/param norm = 1.4374e-01, time/batch = 0.1692s 358/9500 (epoch 1.884), train_loss = 2.21218154, grad/param norm = 1.1912e-01, time/batch = 0.1689s 359/9500 (epoch 1.889), train_loss = 2.17410570, grad/param norm = 7.4946e-02, time/batch = 0.1693s 360/9500 (epoch 1.895), train_loss = 2.13761853, grad/param norm = 6.6347e-02, time/batch = 0.1691s 361/9500 (epoch 1.900), train_loss = 2.12603896, grad/param norm = 6.3467e-02, time/batch = 0.1672s 362/9500 (epoch 1.905), train_loss = 2.15620481, grad/param norm = 6.5260e-02, time/batch = 0.1692s 363/9500 (epoch 1.911), train_loss = 2.11674849, grad/param norm = 6.8263e-02, time/batch = 0.1695s 364/9500 (epoch 1.916), train_loss = 2.13061324, grad/param norm = 8.1959e-02, time/batch = 0.1693s 365/9500 (epoch 1.921), train_loss = 2.12896523, grad/param norm = 8.2872e-02, time/batch = 0.1697s 366/9500 (epoch 1.926), train_loss = 2.14974713, grad/param norm = 8.2680e-02, time/batch = 0.1692s 367/9500 (epoch 1.932), train_loss = 2.14153547, grad/param norm = 7.8719e-02, time/batch = 0.1696s 368/9500 (epoch 1.937), train_loss = 2.15340056, grad/param norm = 7.9807e-02, time/batch = 0.1698s 369/9500 (epoch 1.942), train_loss = 2.15053487, grad/param norm = 7.6839e-02, time/batch = 0.1688s 370/9500 (epoch 1.947), train_loss = 2.15090379, grad/param norm = 8.6387e-02, time/batch = 0.1694s 371/9500 (epoch 1.953), train_loss = 2.17413546, grad/param norm = 9.2852e-02, time/batch = 0.1683s 372/9500 (epoch 1.958), train_loss = 2.13467910, grad/param norm = 9.3191e-02, time/batch = 0.1687s 373/9500 (epoch 1.963), train_loss = 2.15774345, grad/param norm = 1.0283e-01, time/batch = 0.1695s 374/9500 (epoch 1.968), train_loss = 2.15074333, grad/param norm = 1.1219e-01, time/batch = 0.1698s 375/9500 (epoch 1.974), train_loss = 2.13477335, grad/param norm = 9.7862e-02, time/batch = 0.1689s 376/9500 (epoch 1.979), train_loss = 2.12953307, grad/param norm = 6.9019e-02, time/batch = 0.1688s 377/9500 (epoch 1.984), train_loss = 2.08940562, grad/param norm = 6.9708e-02, time/batch = 0.1696s 378/9500 (epoch 1.989), train_loss = 2.14820351, grad/param norm = 7.9948e-02, time/batch = 0.1696s 379/9500 (epoch 1.995), train_loss = 2.13690062, grad/param norm = 7.4366e-02, time/batch = 0.1691s 380/9500 (epoch 2.000), train_loss = 2.12444094, grad/param norm = 6.7229e-02, time/batch = 0.1697s 381/9500 (epoch 2.005), train_loss = 2.19733582, grad/param norm = 6.6125e-02, time/batch = 0.1675s 382/9500 (epoch 2.011), train_loss = 2.11979829, grad/param norm = 7.9547e-02, time/batch = 0.1689s 383/9500 (epoch 2.016), train_loss = 2.14908387, grad/param norm = 1.0015e-01, time/batch = 0.1704s 384/9500 (epoch 2.021), train_loss = 2.16144784, grad/param norm = 1.1224e-01, time/batch = 0.1695s 385/9500 (epoch 2.026), train_loss = 2.20153716, grad/param norm = 1.1283e-01, time/batch = 0.1693s 386/9500 (epoch 2.032), train_loss = 2.16543418, grad/param norm = 1.0081e-01, time/batch = 0.1694s 387/9500 (epoch 2.037), train_loss = 2.14024927, grad/param norm = 8.7125e-02, time/batch = 0.1693s 388/9500 (epoch 2.042), train_loss = 2.09341615, grad/param norm = 7.3131e-02, time/batch = 0.1690s 389/9500 (epoch 2.047), train_loss = 2.11119495, grad/param norm = 7.1604e-02, time/batch = 0.1693s 390/9500 (epoch 2.053), train_loss = 2.10058323, grad/param norm = 6.9514e-02, time/batch = 0.1690s 391/9500 (epoch 2.058), train_loss = 2.10787559, grad/param norm = 5.5379e-02, time/batch = 0.1672s 392/9500 (epoch 2.063), train_loss = 2.03389045, grad/param norm = 5.0330e-02, time/batch = 0.1696s 393/9500 (epoch 2.068), train_loss = 2.05779152, grad/param norm = 5.5753e-02, time/batch = 0.1689s 394/9500 (epoch 2.074), train_loss = 2.09085763, grad/param norm = 5.7103e-02, time/batch = 0.1692s 395/9500 (epoch 2.079), train_loss = 2.07912022, grad/param norm = 6.3371e-02, time/batch = 0.1688s 396/9500 (epoch 2.084), train_loss = 2.12627938, grad/param norm = 6.0436e-02, time/batch = 0.1694s 397/9500 (epoch 2.089), train_loss = 2.10913948, grad/param norm = 7.4728e-02, time/batch = 0.1698s 398/9500 (epoch 2.095), train_loss = 2.09446039, grad/param norm = 1.2840e-01, time/batch = 0.1693s 399/9500 (epoch 2.100), train_loss = 2.15564540, grad/param norm = 1.4231e-01, time/batch = 0.1689s 400/9500 (epoch 2.105), train_loss = 2.13715231, grad/param norm = 1.0472e-01, time/batch = 0.1698s 401/9500 (epoch 2.111), train_loss = 2.04814420, grad/param norm = 7.7260e-02, time/batch = 0.1678s 402/9500 (epoch 2.116), train_loss = 2.06672402, grad/param norm = 6.3045e-02, time/batch = 0.1695s 403/9500 (epoch 2.121), train_loss = 2.05114076, grad/param norm = 5.7376e-02, time/batch = 0.1701s 404/9500 (epoch 2.126), train_loss = 2.08165313, grad/param norm = 5.4455e-02, time/batch = 0.1691s 405/9500 (epoch 2.132), train_loss = 2.07677452, grad/param norm = 5.7086e-02, time/batch = 0.1692s 406/9500 (epoch 2.137), train_loss = 2.10919469, grad/param norm = 6.0713e-02, time/batch = 0.1696s 407/9500 (epoch 2.142), train_loss = 2.06248607, grad/param norm = 6.4338e-02, time/batch = 0.1692s 408/9500 (epoch 2.147), train_loss = 2.06789037, grad/param norm = 7.2860e-02, time/batch = 0.1691s 409/9500 (epoch 2.153), train_loss = 2.07315750, grad/param norm = 7.3565e-02, time/batch = 0.1702s 410/9500 (epoch 2.158), train_loss = 2.06759863, grad/param norm = 6.7623e-02, time/batch = 0.1685s 411/9500 (epoch 2.163), train_loss = 2.07092345, grad/param norm = 8.2426e-02, time/batch = 0.1680s 412/9500 (epoch 2.168), train_loss = 2.05095377, grad/param norm = 7.8403e-02, time/batch = 0.1699s 413/9500 (epoch 2.174), train_loss = 2.02770036, grad/param norm = 8.7164e-02, time/batch = 0.1692s 414/9500 (epoch 2.179), train_loss = 2.11526294, grad/param norm = 1.0210e-01, time/batch = 0.1692s 415/9500 (epoch 2.184), train_loss = 2.06162541, grad/param norm = 1.0046e-01, time/batch = 0.1696s 416/9500 (epoch 2.189), train_loss = 2.07784310, grad/param norm = 8.9228e-02, time/batch = 0.1689s 417/9500 (epoch 2.195), train_loss = 2.03505677, grad/param norm = 6.7889e-02, time/batch = 0.1692s 418/9500 (epoch 2.200), train_loss = 2.07345061, grad/param norm = 7.4336e-02, time/batch = 0.1698s 419/9500 (epoch 2.205), train_loss = 2.03232973, grad/param norm = 8.4112e-02, time/batch = 0.1679s 420/9500 (epoch 2.211), train_loss = 2.05735177, grad/param norm = 8.2435e-02, time/batch = 0.1694s 421/9500 (epoch 2.216), train_loss = 2.03737761, grad/param norm = 7.5686e-02, time/batch = 0.1674s 422/9500 (epoch 2.221), train_loss = 2.03024358, grad/param norm = 6.1652e-02, time/batch = 0.1685s 423/9500 (epoch 2.226), train_loss = 2.03189111, grad/param norm = 5.7557e-02, time/batch = 0.1695s 424/9500 (epoch 2.232), train_loss = 2.06050623, grad/param norm = 6.2893e-02, time/batch = 0.1693s 425/9500 (epoch 2.237), train_loss = 2.02888651, grad/param norm = 7.4358e-02, time/batch = 0.1702s 426/9500 (epoch 2.242), train_loss = 2.06969070, grad/param norm = 8.0966e-02, time/batch = 0.1690s 427/9500 (epoch 2.247), train_loss = 2.06824634, grad/param norm = 9.4836e-02, time/batch = 0.1699s 428/9500 (epoch 2.253), train_loss = 2.06225039, grad/param norm = 9.7613e-02, time/batch = 0.1701s 429/9500 (epoch 2.258), train_loss = 2.07608420, grad/param norm = 7.9984e-02, time/batch = 0.1687s 430/9500 (epoch 2.263), train_loss = 2.03506608, grad/param norm = 7.6970e-02, time/batch = 0.1693s 431/9500 (epoch 2.268), train_loss = 2.00406190, grad/param norm = 6.7939e-02, time/batch = 0.1682s 432/9500 (epoch 2.274), train_loss = 2.00990194, grad/param norm = 5.4428e-02, time/batch = 0.1694s 433/9500 (epoch 2.279), train_loss = 1.99752350, grad/param norm = 4.7198e-02, time/batch = 0.1694s 434/9500 (epoch 2.284), train_loss = 2.04182310, grad/param norm = 4.6068e-02, time/batch = 0.1696s 435/9500 (epoch 2.289), train_loss = 1.99085840, grad/param norm = 5.8305e-02, time/batch = 0.1690s 436/9500 (epoch 2.295), train_loss = 2.01940316, grad/param norm = 7.5548e-02, time/batch = 0.1694s 437/9500 (epoch 2.300), train_loss = 2.03830709, grad/param norm = 7.2778e-02, time/batch = 0.1698s 438/9500 (epoch 2.305), train_loss = 2.04007906, grad/param norm = 6.0774e-02, time/batch = 0.1691s 439/9500 (epoch 2.311), train_loss = 1.99201864, grad/param norm = 6.5663e-02, time/batch = 0.1689s 440/9500 (epoch 2.316), train_loss = 2.03984353, grad/param norm = 7.8163e-02, time/batch = 0.1696s 441/9500 (epoch 2.321), train_loss = 2.03469516, grad/param norm = 8.6206e-02, time/batch = 0.1678s 442/9500 (epoch 2.326), train_loss = 2.05842210, grad/param norm = 8.1727e-02, time/batch = 0.1689s 443/9500 (epoch 2.332), train_loss = 2.03501492, grad/param norm = 7.5565e-02, time/batch = 0.1700s 444/9500 (epoch 2.337), train_loss = 1.99769835, grad/param norm = 6.3255e-02, time/batch = 0.1691s 445/9500 (epoch 2.342), train_loss = 1.96149824, grad/param norm = 5.1880e-02, time/batch = 0.1691s 446/9500 (epoch 2.347), train_loss = 1.95617614, grad/param norm = 4.6940e-02, time/batch = 0.1693s 447/9500 (epoch 2.353), train_loss = 1.98595711, grad/param norm = 5.0623e-02, time/batch = 0.1694s 448/9500 (epoch 2.358), train_loss = 1.96133847, grad/param norm = 5.9436e-02, time/batch = 0.1695s 449/9500 (epoch 2.363), train_loss = 1.97892025, grad/param norm = 7.3154e-02, time/batch = 0.1686s 450/9500 (epoch 2.368), train_loss = 1.99704723, grad/param norm = 8.0311e-02, time/batch = 0.1698s 451/9500 (epoch 2.374), train_loss = 1.99042685, grad/param norm = 7.1340e-02, time/batch = 0.1676s 452/9500 (epoch 2.379), train_loss = 1.97674540, grad/param norm = 5.9390e-02, time/batch = 0.1695s 453/9500 (epoch 2.384), train_loss = 1.99500454, grad/param norm = 6.1047e-02, time/batch = 0.1693s 454/9500 (epoch 2.389), train_loss = 1.97899568, grad/param norm = 6.4092e-02, time/batch = 0.1693s 455/9500 (epoch 2.395), train_loss = 2.00439911, grad/param norm = 6.6185e-02, time/batch = 0.1701s 456/9500 (epoch 2.400), train_loss = 2.00213442, grad/param norm = 6.4996e-02, time/batch = 0.1690s 457/9500 (epoch 2.405), train_loss = 2.00432563, grad/param norm = 5.5966e-02, time/batch = 0.1698s 458/9500 (epoch 2.411), train_loss = 1.99580312, grad/param norm = 5.5648e-02, time/batch = 0.1690s 459/9500 (epoch 2.416), train_loss = 1.98826416, grad/param norm = 5.5029e-02, time/batch = 0.1697s 460/9500 (epoch 2.421), train_loss = 1.99678118, grad/param norm = 5.8621e-02, time/batch = 0.1701s 461/9500 (epoch 2.426), train_loss = 1.97675613, grad/param norm = 6.0451e-02, time/batch = 0.1679s 462/9500 (epoch 2.432), train_loss = 1.99692520, grad/param norm = 7.1479e-02, time/batch = 0.1695s 463/9500 (epoch 2.437), train_loss = 2.00612945, grad/param norm = 6.7842e-02, time/batch = 0.1698s 464/9500 (epoch 2.442), train_loss = 1.97794791, grad/param norm = 5.8713e-02, time/batch = 0.1687s 465/9500 (epoch 2.447), train_loss = 1.96926267, grad/param norm = 5.9240e-02, time/batch = 0.1695s 466/9500 (epoch 2.453), train_loss = 1.98670712, grad/param norm = 6.5272e-02, time/batch = 0.1697s 467/9500 (epoch 2.458), train_loss = 1.96926095, grad/param norm = 7.8357e-02, time/batch = 0.1694s 468/9500 (epoch 2.463), train_loss = 2.01704074, grad/param norm = 9.2227e-02, time/batch = 0.1695s 469/9500 (epoch 2.468), train_loss = 1.98767720, grad/param norm = 9.1166e-02, time/batch = 0.1698s 470/9500 (epoch 2.474), train_loss = 1.99060822, grad/param norm = 7.4611e-02, time/batch = 0.1695s 471/9500 (epoch 2.479), train_loss = 1.96081599, grad/param norm = 6.8998e-02, time/batch = 0.1675s 472/9500 (epoch 2.484), train_loss = 1.96887892, grad/param norm = 5.5681e-02, time/batch = 0.1699s 473/9500 (epoch 2.489), train_loss = 1.97682938, grad/param norm = 4.6408e-02, time/batch = 0.1693s 474/9500 (epoch 2.495), train_loss = 1.95999201, grad/param norm = 4.6819e-02, time/batch = 0.1693s 475/9500 (epoch 2.500), train_loss = 1.94560495, grad/param norm = 5.1971e-02, time/batch = 0.1704s 476/9500 (epoch 2.505), train_loss = 1.91802483, grad/param norm = 5.7015e-02, time/batch = 0.1697s 477/9500 (epoch 2.511), train_loss = 1.97389381, grad/param norm = 6.5461e-02, time/batch = 0.1692s 478/9500 (epoch 2.516), train_loss = 1.97972420, grad/param norm = 6.6176e-02, time/batch = 0.1692s 479/9500 (epoch 2.521), train_loss = 1.98788434, grad/param norm = 7.0251e-02, time/batch = 0.1702s 480/9500 (epoch 2.526), train_loss = 2.01638247, grad/param norm = 8.6020e-02, time/batch = 0.1691s 481/9500 (epoch 2.532), train_loss = 2.04740840, grad/param norm = 7.4106e-02, time/batch = 0.1676s 482/9500 (epoch 2.537), train_loss = 1.97773888, grad/param norm = 5.6706e-02, time/batch = 0.1684s 483/9500 (epoch 2.542), train_loss = 1.94007581, grad/param norm = 6.0240e-02, time/batch = 0.1697s 484/9500 (epoch 2.547), train_loss = 1.95144682, grad/param norm = 5.7884e-02, time/batch = 0.1696s 485/9500 (epoch 2.553), train_loss = 1.93781355, grad/param norm = 5.0974e-02, time/batch = 0.1701s 486/9500 (epoch 2.558), train_loss = 1.89508384, grad/param norm = 4.6867e-02, time/batch = 0.1693s 487/9500 (epoch 2.563), train_loss = 1.91421286, grad/param norm = 5.1458e-02, time/batch = 0.1690s 488/9500 (epoch 2.568), train_loss = 1.94030935, grad/param norm = 6.5399e-02, time/batch = 0.1699s 489/9500 (epoch 2.574), train_loss = 1.95183559, grad/param norm = 7.4217e-02, time/batch = 0.1688s 490/9500 (epoch 2.579), train_loss = 1.92077214, grad/param norm = 6.7566e-02, time/batch = 0.1692s 491/9500 (epoch 2.584), train_loss = 1.93496325, grad/param norm = 5.8619e-02, time/batch = 0.1685s 492/9500 (epoch 2.589), train_loss = 1.89806318, grad/param norm = 5.0039e-02, time/batch = 0.1693s 493/9500 (epoch 2.595), train_loss = 1.92080993, grad/param norm = 5.6614e-02, time/batch = 0.1691s 494/9500 (epoch 2.600), train_loss = 1.94387876, grad/param norm = 5.8456e-02, time/batch = 0.1699s 495/9500 (epoch 2.605), train_loss = 1.93259562, grad/param norm = 6.3409e-02, time/batch = 0.1695s 496/9500 (epoch 2.611), train_loss = 1.95020174, grad/param norm = 6.7488e-02, time/batch = 0.1700s 497/9500 (epoch 2.616), train_loss = 1.94207162, grad/param norm = 6.4943e-02, time/batch = 0.1703s 498/9500 (epoch 2.621), train_loss = 1.93701231, grad/param norm = 5.8499e-02, time/batch = 0.1691s 499/9500 (epoch 2.626), train_loss = 1.90927963, grad/param norm = 5.0635e-02, time/batch = 0.1695s 500/9500 (epoch 2.632), train_loss = 1.90199680, grad/param norm = 4.7572e-02, time/batch = 0.1701s 501/9500 (epoch 2.637), train_loss = 1.87797126, grad/param norm = 4.5660e-02, time/batch = 0.1676s 502/9500 (epoch 2.642), train_loss = 1.92205492, grad/param norm = 5.0526e-02, time/batch = 0.1695s 503/9500 (epoch 2.647), train_loss = 1.93068647, grad/param norm = 5.3872e-02, time/batch = 0.1703s 504/9500 (epoch 2.653), train_loss = 1.92930004, grad/param norm = 5.7283e-02, time/batch = 0.1695s 505/9500 (epoch 2.658), train_loss = 1.92525824, grad/param norm = 6.1745e-02, time/batch = 0.1693s 506/9500 (epoch 2.663), train_loss = 1.93505249, grad/param norm = 6.2759e-02, time/batch = 0.1690s 507/9500 (epoch 2.668), train_loss = 1.92402983, grad/param norm = 5.9564e-02, time/batch = 0.1695s 508/9500 (epoch 2.674), train_loss = 1.96721907, grad/param norm = 5.9207e-02, time/batch = 0.1687s 509/9500 (epoch 2.679), train_loss = 1.95285371, grad/param norm = 5.9355e-02, time/batch = 0.1696s 510/9500 (epoch 2.684), train_loss = 1.91990795, grad/param norm = 5.8127e-02, time/batch = 0.1700s 511/9500 (epoch 2.689), train_loss = 1.92139935, grad/param norm = 5.5599e-02, time/batch = 0.1674s 512/9500 (epoch 2.695), train_loss = 1.90295196, grad/param norm = 4.7926e-02, time/batch = 0.1694s 513/9500 (epoch 2.700), train_loss = 1.90097223, grad/param norm = 4.4270e-02, time/batch = 0.1700s 514/9500 (epoch 2.705), train_loss = 1.91268325, grad/param norm = 4.3302e-02, time/batch = 0.1695s 515/9500 (epoch 2.711), train_loss = 1.93182104, grad/param norm = 4.6179e-02, time/batch = 0.1693s 516/9500 (epoch 2.716), train_loss = 1.91054800, grad/param norm = 5.1577e-02, time/batch = 0.1701s 517/9500 (epoch 2.721), train_loss = 1.87174448, grad/param norm = 6.8137e-02, time/batch = 0.1691s 518/9500 (epoch 2.726), train_loss = 1.90458450, grad/param norm = 7.3484e-02, time/batch = 0.1694s 519/9500 (epoch 2.732), train_loss = 1.88732736, grad/param norm = 7.2377e-02, time/batch = 0.1703s 520/9500 (epoch 2.737), train_loss = 1.94510425, grad/param norm = 6.7922e-02, time/batch = 0.1695s 521/9500 (epoch 2.742), train_loss = 1.88885537, grad/param norm = 6.1867e-02, time/batch = 0.1678s 522/9500 (epoch 2.747), train_loss = 1.90226313, grad/param norm = 6.0708e-02, time/batch = 0.1697s 523/9500 (epoch 2.753), train_loss = 1.89149776, grad/param norm = 4.9649e-02, time/batch = 0.1695s 524/9500 (epoch 2.758), train_loss = 1.91792841, grad/param norm = 4.6505e-02, time/batch = 0.1692s 525/9500 (epoch 2.763), train_loss = 1.85387295, grad/param norm = 4.6291e-02, time/batch = 0.1706s 526/9500 (epoch 2.768), train_loss = 1.87974305, grad/param norm = 4.6410e-02, time/batch = 0.1692s 527/9500 (epoch 2.774), train_loss = 1.87084890, grad/param norm = 5.7665e-02, time/batch = 0.1694s 528/9500 (epoch 2.779), train_loss = 1.90143973, grad/param norm = 5.4203e-02, time/batch = 0.1700s 529/9500 (epoch 2.784), train_loss = 1.86293674, grad/param norm = 5.1877e-02, time/batch = 0.1696s 530/9500 (epoch 2.789), train_loss = 1.87207224, grad/param norm = 5.2311e-02, time/batch = 0.1695s 531/9500 (epoch 2.795), train_loss = 1.87906315, grad/param norm = 5.2316e-02, time/batch = 0.1684s 532/9500 (epoch 2.800), train_loss = 1.87079769, grad/param norm = 4.8031e-02, time/batch = 0.1691s 533/9500 (epoch 2.805), train_loss = 1.81728657, grad/param norm = 4.0500e-02, time/batch = 0.1700s 534/9500 (epoch 2.811), train_loss = 1.85256853, grad/param norm = 4.1415e-02, time/batch = 0.1694s 535/9500 (epoch 2.816), train_loss = 1.83228573, grad/param norm = 4.2823e-02, time/batch = 0.1695s 536/9500 (epoch 2.821), train_loss = 1.84020066, grad/param norm = 4.5536e-02, time/batch = 0.1701s 537/9500 (epoch 2.826), train_loss = 1.87035671, grad/param norm = 4.5981e-02, time/batch = 0.1696s 538/9500 (epoch 2.832), train_loss = 1.85629068, grad/param norm = 4.7473e-02, time/batch = 0.1696s 539/9500 (epoch 2.837), train_loss = 1.85260880, grad/param norm = 4.7161e-02, time/batch = 0.1692s 540/9500 (epoch 2.842), train_loss = 1.87969712, grad/param norm = 5.4972e-02, time/batch = 0.1690s 541/9500 (epoch 2.847), train_loss = 1.91051828, grad/param norm = 6.8706e-02, time/batch = 0.1685s 542/9500 (epoch 2.853), train_loss = 1.90196894, grad/param norm = 7.6208e-02, time/batch = 0.1694s 543/9500 (epoch 2.858), train_loss = 1.88901695, grad/param norm = 5.8855e-02, time/batch = 0.1693s 544/9500 (epoch 2.863), train_loss = 1.87920945, grad/param norm = 4.5858e-02, time/batch = 0.1700s 545/9500 (epoch 2.868), train_loss = 1.84687009, grad/param norm = 4.0043e-02, time/batch = 0.1693s 546/9500 (epoch 2.874), train_loss = 1.83711427, grad/param norm = 3.6636e-02, time/batch = 0.1691s 547/9500 (epoch 2.879), train_loss = 1.81276314, grad/param norm = 3.8621e-02, time/batch = 0.1702s 548/9500 (epoch 2.884), train_loss = 1.85328957, grad/param norm = 4.5395e-02, time/batch = 0.1698s 549/9500 (epoch 2.889), train_loss = 1.85911679, grad/param norm = 5.6605e-02, time/batch = 0.1687s 550/9500 (epoch 2.895), train_loss = 1.84301453, grad/param norm = 5.9792e-02, time/batch = 0.1703s 551/9500 (epoch 2.900), train_loss = 1.85224115, grad/param norm = 6.5236e-02, time/batch = 0.1674s 552/9500 (epoch 2.905), train_loss = 1.90556280, grad/param norm = 6.8136e-02, time/batch = 0.1705s 553/9500 (epoch 2.911), train_loss = 1.82926530, grad/param norm = 5.9384e-02, time/batch = 0.1701s 554/9500 (epoch 2.916), train_loss = 1.83039433, grad/param norm = 4.8793e-02, time/batch = 0.1689s 555/9500 (epoch 2.921), train_loss = 1.82266969, grad/param norm = 4.2242e-02, time/batch = 0.1696s 556/9500 (epoch 2.926), train_loss = 1.84657590, grad/param norm = 4.5730e-02, time/batch = 0.1695s 557/9500 (epoch 2.932), train_loss = 1.84131256, grad/param norm = 4.8838e-02, time/batch = 0.1694s 558/9500 (epoch 2.937), train_loss = 1.86687087, grad/param norm = 5.3601e-02, time/batch = 0.1694s 559/9500 (epoch 2.942), train_loss = 1.86916329, grad/param norm = 6.2592e-02, time/batch = 0.1692s 560/9500 (epoch 2.947), train_loss = 1.87084273, grad/param norm = 5.9372e-02, time/batch = 0.1698s 561/9500 (epoch 2.953), train_loss = 1.87519591, grad/param norm = 4.9984e-02, time/batch = 0.1678s 562/9500 (epoch 2.958), train_loss = 1.85490094, grad/param norm = 4.5317e-02, time/batch = 0.1691s 563/9500 (epoch 2.963), train_loss = 1.84892019, grad/param norm = 4.5259e-02, time/batch = 0.1701s 564/9500 (epoch 2.968), train_loss = 1.84495144, grad/param norm = 4.4997e-02, time/batch = 0.1693s 565/9500 (epoch 2.974), train_loss = 1.83812212, grad/param norm = 4.5136e-02, time/batch = 0.1694s 566/9500 (epoch 2.979), train_loss = 1.85847751, grad/param norm = 5.1556e-02, time/batch = 0.1704s 567/9500 (epoch 2.984), train_loss = 1.83807545, grad/param norm = 5.1985e-02, time/batch = 0.1691s 568/9500 (epoch 2.989), train_loss = 1.86318513, grad/param norm = 4.9326e-02, time/batch = 0.1696s 569/9500 (epoch 2.995), train_loss = 1.87010762, grad/param norm = 4.3792e-02, time/batch = 0.1701s 570/9500 (epoch 3.000), train_loss = 1.86753634, grad/param norm = 4.1799e-02, time/batch = 0.1690s 571/9500 (epoch 3.005), train_loss = 1.95537313, grad/param norm = 4.1294e-02, time/batch = 0.1676s 572/9500 (epoch 3.011), train_loss = 1.85101116, grad/param norm = 4.4363e-02, time/batch = 0.1698s 573/9500 (epoch 3.016), train_loss = 1.87318309, grad/param norm = 4.4513e-02, time/batch = 0.1695s 574/9500 (epoch 3.021), train_loss = 1.83887270, grad/param norm = 4.6123e-02, time/batch = 0.1690s 575/9500 (epoch 3.026), train_loss = 1.86353029, grad/param norm = 4.7851e-02, time/batch = 0.1697s 576/9500 (epoch 3.032), train_loss = 1.86116394, grad/param norm = 4.7900e-02, time/batch = 0.1692s 577/9500 (epoch 3.037), train_loss = 1.81750501, grad/param norm = 4.5996e-02, time/batch = 0.1694s 578/9500 (epoch 3.042), train_loss = 1.81265029, grad/param norm = 4.7966e-02, time/batch = 0.1701s 579/9500 (epoch 3.047), train_loss = 1.83816731, grad/param norm = 4.4847e-02, time/batch = 0.1690s 580/9500 (epoch 3.053), train_loss = 1.83383073, grad/param norm = 4.5210e-02, time/batch = 0.1692s 581/9500 (epoch 3.058), train_loss = 1.85912119, grad/param norm = 4.4638e-02, time/batch = 0.1679s 582/9500 (epoch 3.063), train_loss = 1.80435811, grad/param norm = 5.1897e-02, time/batch = 0.1694s 583/9500 (epoch 3.068), train_loss = 1.81365465, grad/param norm = 6.0729e-02, time/batch = 0.1694s 584/9500 (epoch 3.074), train_loss = 1.85385983, grad/param norm = 7.0284e-02, time/batch = 0.1695s 585/9500 (epoch 3.079), train_loss = 1.84989352, grad/param norm = 6.9419e-02, time/batch = 0.1692s 586/9500 (epoch 3.084), train_loss = 1.90681555, grad/param norm = 6.7327e-02, time/batch = 0.1695s 587/9500 (epoch 3.089), train_loss = 1.88156702, grad/param norm = 6.6523e-02, time/batch = 0.1695s 588/9500 (epoch 3.095), train_loss = 1.82472332, grad/param norm = 5.1840e-02, time/batch = 0.1687s 589/9500 (epoch 3.100), train_loss = 1.82327357, grad/param norm = 4.4981e-02, time/batch = 0.1697s 590/9500 (epoch 3.105), train_loss = 1.86524602, grad/param norm = 4.0010e-02, time/batch = 0.1692s 591/9500 (epoch 3.111), train_loss = 1.77854895, grad/param norm = 3.9045e-02, time/batch = 0.1683s 592/9500 (epoch 3.116), train_loss = 1.79578777, grad/param norm = 4.2107e-02, time/batch = 0.1694s 593/9500 (epoch 3.121), train_loss = 1.81334460, grad/param norm = 4.2988e-02, time/batch = 0.1693s 594/9500 (epoch 3.126), train_loss = 1.83485083, grad/param norm = 4.5583e-02, time/batch = 0.1700s 595/9500 (epoch 3.132), train_loss = 1.84276733, grad/param norm = 4.8822e-02, time/batch = 0.1694s 596/9500 (epoch 3.137), train_loss = 1.87020520, grad/param norm = 4.3081e-02, time/batch = 0.1691s 597/9500 (epoch 3.142), train_loss = 1.80029679, grad/param norm = 3.6300e-02, time/batch = 0.1697s 598/9500 (epoch 3.147), train_loss = 1.78793058, grad/param norm = 4.1272e-02, time/batch = 0.1697s 599/9500 (epoch 3.153), train_loss = 1.81768034, grad/param norm = 4.2506e-02, time/batch = 0.1696s 600/9500 (epoch 3.158), train_loss = 1.80982726, grad/param norm = 3.8675e-02, time/batch = 0.1700s 601/9500 (epoch 3.163), train_loss = 1.83936877, grad/param norm = 4.3528e-02, time/batch = 0.1684s 602/9500 (epoch 3.168), train_loss = 1.80401972, grad/param norm = 4.4485e-02, time/batch = 0.1698s 603/9500 (epoch 3.174), train_loss = 1.79040774, grad/param norm = 4.3550e-02, time/batch = 0.1700s 604/9500 (epoch 3.179), train_loss = 1.86320946, grad/param norm = 4.4685e-02, time/batch = 0.1691s 605/9500 (epoch 3.184), train_loss = 1.80377782, grad/param norm = 4.6752e-02, time/batch = 0.1691s 606/9500 (epoch 3.189), train_loss = 1.83076673, grad/param norm = 4.7664e-02, time/batch = 0.1704s 607/9500 (epoch 3.195), train_loss = 1.79920444, grad/param norm = 4.1736e-02, time/batch = 0.1691s 608/9500 (epoch 3.200), train_loss = 1.80921168, grad/param norm = 4.3235e-02, time/batch = 0.1699s 609/9500 (epoch 3.205), train_loss = 1.78054574, grad/param norm = 4.7414e-02, time/batch = 0.1698s 610/9500 (epoch 3.211), train_loss = 1.81060990, grad/param norm = 4.7518e-02, time/batch = 0.1699s 611/9500 (epoch 3.216), train_loss = 1.81145240, grad/param norm = 4.3713e-02, time/batch = 0.1676s 612/9500 (epoch 3.221), train_loss = 1.80558528, grad/param norm = 3.5515e-02, time/batch = 0.1705s 613/9500 (epoch 3.226), train_loss = 1.81210315, grad/param norm = 3.8165e-02, time/batch = 0.1690s 614/9500 (epoch 3.232), train_loss = 1.82753631, grad/param norm = 4.4103e-02, time/batch = 0.1693s 615/9500 (epoch 3.237), train_loss = 1.81629027, grad/param norm = 4.7005e-02, time/batch = 0.1699s 616/9500 (epoch 3.242), train_loss = 1.83512748, grad/param norm = 4.6471e-02, time/batch = 0.1699s 617/9500 (epoch 3.247), train_loss = 1.81731710, grad/param norm = 4.5593e-02, time/batch = 0.1691s 618/9500 (epoch 3.253), train_loss = 1.80058848, grad/param norm = 4.3190e-02, time/batch = 0.1694s 619/9500 (epoch 3.258), train_loss = 1.81200947, grad/param norm = 4.3211e-02, time/batch = 0.1696s 620/9500 (epoch 3.263), train_loss = 1.77170477, grad/param norm = 4.7676e-02, time/batch = 0.1693s 621/9500 (epoch 3.268), train_loss = 1.76843055, grad/param norm = 5.0132e-02, time/batch = 0.1676s 622/9500 (epoch 3.274), train_loss = 1.80059041, grad/param norm = 5.5416e-02, time/batch = 0.1695s 623/9500 (epoch 3.279), train_loss = 1.81564609, grad/param norm = 5.9781e-02, time/batch = 0.1697s 624/9500 (epoch 3.284), train_loss = 1.83986687, grad/param norm = 6.5311e-02, time/batch = 0.1699s 625/9500 (epoch 3.289), train_loss = 1.78611576, grad/param norm = 6.4258e-02, time/batch = 0.1694s 626/9500 (epoch 3.295), train_loss = 1.78806943, grad/param norm = 4.8388e-02, time/batch = 0.1700s 627/9500 (epoch 3.300), train_loss = 1.81478295, grad/param norm = 3.9739e-02, time/batch = 0.1695s 628/9500 (epoch 3.305), train_loss = 1.80561443, grad/param norm = 3.8853e-02, time/batch = 0.1690s 629/9500 (epoch 3.311), train_loss = 1.76836010, grad/param norm = 4.2556e-02, time/batch = 0.1701s 630/9500 (epoch 3.316), train_loss = 1.80009086, grad/param norm = 4.2463e-02, time/batch = 0.1691s 631/9500 (epoch 3.321), train_loss = 1.80444720, grad/param norm = 4.2388e-02, time/batch = 0.1679s 632/9500 (epoch 3.326), train_loss = 1.82457549, grad/param norm = 4.2321e-02, time/batch = 0.1705s 633/9500 (epoch 3.332), train_loss = 1.80834429, grad/param norm = 3.9830e-02, time/batch = 0.1691s 634/9500 (epoch 3.337), train_loss = 1.78631243, grad/param norm = 3.7436e-02, time/batch = 0.1699s 635/9500 (epoch 3.342), train_loss = 1.76927459, grad/param norm = 3.6327e-02, time/batch = 0.1700s 636/9500 (epoch 3.347), train_loss = 1.76974510, grad/param norm = 3.8331e-02, time/batch = 0.1690s 637/9500 (epoch 3.353), train_loss = 1.78676890, grad/param norm = 4.2768e-02, time/batch = 0.1695s 638/9500 (epoch 3.358), train_loss = 1.76189095, grad/param norm = 4.1431e-02, time/batch = 0.1703s 639/9500 (epoch 3.363), train_loss = 1.75680068, grad/param norm = 4.4314e-02, time/batch = 0.1696s 640/9500 (epoch 3.368), train_loss = 1.77559983, grad/param norm = 5.2606e-02, time/batch = 0.1694s 641/9500 (epoch 3.374), train_loss = 1.80623517, grad/param norm = 5.6592e-02, time/batch = 0.1686s 642/9500 (epoch 3.379), train_loss = 1.78997317, grad/param norm = 4.6973e-02, time/batch = 0.1694s 643/9500 (epoch 3.384), train_loss = 1.77873374, grad/param norm = 3.8688e-02, time/batch = 0.1697s 644/9500 (epoch 3.389), train_loss = 1.77711238, grad/param norm = 3.7824e-02, time/batch = 0.1703s 645/9500 (epoch 3.395), train_loss = 1.77372936, grad/param norm = 3.6224e-02, time/batch = 0.1691s 646/9500 (epoch 3.400), train_loss = 1.80325129, grad/param norm = 3.9344e-02, time/batch = 0.1692s 647/9500 (epoch 3.405), train_loss = 1.80599959, grad/param norm = 3.8560e-02, time/batch = 0.1695s 648/9500 (epoch 3.411), train_loss = 1.79874394, grad/param norm = 3.8413e-02, time/batch = 0.1697s 649/9500 (epoch 3.416), train_loss = 1.79162630, grad/param norm = 3.9938e-02, time/batch = 0.1693s 650/9500 (epoch 3.421), train_loss = 1.80168123, grad/param norm = 3.5983e-02, time/batch = 0.1692s 651/9500 (epoch 3.426), train_loss = 1.76331738, grad/param norm = 3.6479e-02, time/batch = 0.1681s 652/9500 (epoch 3.432), train_loss = 1.77353806, grad/param norm = 3.9183e-02, time/batch = 0.1689s 653/9500 (epoch 3.437), train_loss = 1.80122052, grad/param norm = 4.2918e-02, time/batch = 0.1693s 654/9500 (epoch 3.442), train_loss = 1.80728120, grad/param norm = 4.4306e-02, time/batch = 0.1698s 655/9500 (epoch 3.447), train_loss = 1.77951903, grad/param norm = 4.7183e-02, time/batch = 0.1691s 656/9500 (epoch 3.453), train_loss = 1.78742464, grad/param norm = 4.6887e-02, time/batch = 0.1697s 657/9500 (epoch 3.458), train_loss = 1.76388273, grad/param norm = 4.8532e-02, time/batch = 0.1705s 658/9500 (epoch 3.463), train_loss = 1.80837691, grad/param norm = 4.9809e-02, time/batch = 0.1694s 659/9500 (epoch 3.468), train_loss = 1.78282047, grad/param norm = 5.2433e-02, time/batch = 0.1695s 660/9500 (epoch 3.474), train_loss = 1.79175293, grad/param norm = 5.3311e-02, time/batch = 0.1699s 661/9500 (epoch 3.479), train_loss = 1.76551908, grad/param norm = 4.3421e-02, time/batch = 0.1676s 662/9500 (epoch 3.484), train_loss = 1.78204399, grad/param norm = 3.9790e-02, time/batch = 0.1697s 663/9500 (epoch 3.489), train_loss = 1.79288179, grad/param norm = 3.9606e-02, time/batch = 0.1702s 664/9500 (epoch 3.495), train_loss = 1.79020545, grad/param norm = 4.2361e-02, time/batch = 0.1694s 665/9500 (epoch 3.500), train_loss = 1.76409814, grad/param norm = 4.0439e-02, time/batch = 0.1698s 666/9500 (epoch 3.505), train_loss = 1.75201327, grad/param norm = 4.0866e-02, time/batch = 0.1699s 667/9500 (epoch 3.511), train_loss = 1.80777675, grad/param norm = 4.2955e-02, time/batch = 0.1692s 668/9500 (epoch 3.516), train_loss = 1.78990136, grad/param norm = 3.8255e-02, time/batch = 0.1693s 669/9500 (epoch 3.521), train_loss = 1.75501078, grad/param norm = 3.4115e-02, time/batch = 0.1707s 670/9500 (epoch 3.526), train_loss = 1.77833203, grad/param norm = 3.5530e-02, time/batch = 0.1692s 671/9500 (epoch 3.532), train_loss = 1.81386884, grad/param norm = 3.5689e-02, time/batch = 0.1682s 672/9500 (epoch 3.537), train_loss = 1.78632731, grad/param norm = 3.3735e-02, time/batch = 0.1704s 673/9500 (epoch 3.542), train_loss = 1.75348978, grad/param norm = 3.8147e-02, time/batch = 0.1690s 674/9500 (epoch 3.547), train_loss = 1.77019059, grad/param norm = 3.8867e-02, time/batch = 0.1692s 675/9500 (epoch 3.553), train_loss = 1.74064991, grad/param norm = 3.8122e-02, time/batch = 0.1693s 676/9500 (epoch 3.558), train_loss = 1.73402519, grad/param norm = 3.6855e-02, time/batch = 0.1697s 677/9500 (epoch 3.563), train_loss = 1.73643238, grad/param norm = 3.8880e-02, time/batch = 0.1691s 678/9500 (epoch 3.568), train_loss = 1.76063218, grad/param norm = 3.9919e-02, time/batch = 0.1694s 679/9500 (epoch 3.574), train_loss = 1.76949089, grad/param norm = 4.3115e-02, time/batch = 0.1706s 680/9500 (epoch 3.579), train_loss = 1.74841327, grad/param norm = 4.3680e-02, time/batch = 0.1695s 681/9500 (epoch 3.584), train_loss = 1.77670381, grad/param norm = 4.4488e-02, time/batch = 0.1678s 682/9500 (epoch 3.589), train_loss = 1.72950514, grad/param norm = 4.1484e-02, time/batch = 0.1702s 683/9500 (epoch 3.595), train_loss = 1.75080619, grad/param norm = 4.4720e-02, time/batch = 0.1691s 684/9500 (epoch 3.600), train_loss = 1.76947150, grad/param norm = 4.4697e-02, time/batch = 0.1698s 685/9500 (epoch 3.605), train_loss = 1.75174712, grad/param norm = 4.0703e-02, time/batch = 0.1704s 686/9500 (epoch 3.611), train_loss = 1.75587249, grad/param norm = 3.6475e-02, time/batch = 0.1695s 687/9500 (epoch 3.616), train_loss = 1.75002475, grad/param norm = 3.3846e-02, time/batch = 0.1691s 688/9500 (epoch 3.621), train_loss = 1.75332332, grad/param norm = 3.3035e-02, time/batch = 0.1706s 689/9500 (epoch 3.626), train_loss = 1.71537692, grad/param norm = 3.4892e-02, time/batch = 0.1691s 690/9500 (epoch 3.632), train_loss = 1.76720503, grad/param norm = 3.9488e-02, time/batch = 0.1692s 691/9500 (epoch 3.637), train_loss = 1.71771265, grad/param norm = 4.6831e-02, time/batch = 0.1683s 692/9500 (epoch 3.642), train_loss = 1.77643880, grad/param norm = 5.6509e-02, time/batch = 0.1694s 693/9500 (epoch 3.647), train_loss = 1.78119175, grad/param norm = 6.1088e-02, time/batch = 0.1697s 694/9500 (epoch 3.653), train_loss = 1.78204781, grad/param norm = 5.3512e-02, time/batch = 0.1701s 695/9500 (epoch 3.658), train_loss = 1.73908107, grad/param norm = 3.6930e-02, time/batch = 0.1691s 696/9500 (epoch 3.663), train_loss = 1.74140477, grad/param norm = 3.4874e-02, time/batch = 0.1693s 697/9500 (epoch 3.668), train_loss = 1.73128996, grad/param norm = 3.5438e-02, time/batch = 0.1699s 698/9500 (epoch 3.674), train_loss = 1.77941173, grad/param norm = 3.6272e-02, time/batch = 0.1694s 699/9500 (epoch 3.679), train_loss = 1.77278441, grad/param norm = 3.7084e-02, time/batch = 0.1692s 700/9500 (epoch 3.684), train_loss = 1.76013648, grad/param norm = 4.3550e-02, time/batch = 0.1695s 701/9500 (epoch 3.689), train_loss = 1.76204141, grad/param norm = 4.2121e-02, time/batch = 0.1676s 702/9500 (epoch 3.695), train_loss = 1.73786248, grad/param norm = 3.4464e-02, time/batch = 0.1696s 703/9500 (epoch 3.700), train_loss = 1.74876335, grad/param norm = 3.4461e-02, time/batch = 0.1694s 704/9500 (epoch 3.705), train_loss = 1.76601402, grad/param norm = 3.3359e-02, time/batch = 0.1698s 705/9500 (epoch 3.711), train_loss = 1.77734517, grad/param norm = 3.4287e-02, time/batch = 0.1694s 706/9500 (epoch 3.716), train_loss = 1.75355460, grad/param norm = 3.5843e-02, time/batch = 0.1697s 707/9500 (epoch 3.721), train_loss = 1.69308728, grad/param norm = 3.7051e-02, time/batch = 0.1694s 708/9500 (epoch 3.726), train_loss = 1.72896867, grad/param norm = 3.7128e-02, time/batch = 0.1692s 709/9500 (epoch 3.732), train_loss = 1.70397047, grad/param norm = 3.7675e-02, time/batch = 0.1694s 710/9500 (epoch 3.737), train_loss = 1.77687540, grad/param norm = 3.6867e-02, time/batch = 0.1702s 711/9500 (epoch 3.742), train_loss = 1.71585302, grad/param norm = 3.4215e-02, time/batch = 0.1677s 712/9500 (epoch 3.747), train_loss = 1.69915559, grad/param norm = 3.5681e-02, time/batch = 0.1691s 713/9500 (epoch 3.753), train_loss = 1.72366100, grad/param norm = 3.4640e-02, time/batch = 0.1699s 714/9500 (epoch 3.758), train_loss = 1.74328835, grad/param norm = 3.2972e-02, time/batch = 0.1688s 715/9500 (epoch 3.763), train_loss = 1.70238537, grad/param norm = 3.2979e-02, time/batch = 0.1696s 716/9500 (epoch 3.768), train_loss = 1.72236604, grad/param norm = 3.3106e-02, time/batch = 0.1697s 717/9500 (epoch 3.774), train_loss = 1.72975173, grad/param norm = 4.6342e-02, time/batch = 0.1694s 718/9500 (epoch 3.779), train_loss = 1.77293656, grad/param norm = 5.0032e-02, time/batch = 0.1694s 719/9500 (epoch 3.784), train_loss = 1.73741289, grad/param norm = 4.6442e-02, time/batch = 0.1701s 720/9500 (epoch 3.789), train_loss = 1.74295212, grad/param norm = 4.6059e-02, time/batch = 0.1695s 721/9500 (epoch 3.795), train_loss = 1.75090654, grad/param norm = 4.7042e-02, time/batch = 0.1677s 722/9500 (epoch 3.800), train_loss = 1.73434170, grad/param norm = 4.5166e-02, time/batch = 0.1699s 723/9500 (epoch 3.805), train_loss = 1.68175491, grad/param norm = 3.8617e-02, time/batch = 0.1694s 724/9500 (epoch 3.811), train_loss = 1.68830641, grad/param norm = 3.6323e-02, time/batch = 0.1693s 725/9500 (epoch 3.816), train_loss = 1.68355467, grad/param norm = 3.2484e-02, time/batch = 0.1702s 726/9500 (epoch 3.821), train_loss = 1.69512617, grad/param norm = 3.3447e-02, time/batch = 0.1690s 727/9500 (epoch 3.826), train_loss = 1.71940731, grad/param norm = 3.4875e-02, time/batch = 0.1693s 728/9500 (epoch 3.832), train_loss = 1.71243436, grad/param norm = 3.3385e-02, time/batch = 0.1690s 729/9500 (epoch 3.837), train_loss = 1.70550854, grad/param norm = 3.1973e-02, time/batch = 0.1697s 730/9500 (epoch 3.842), train_loss = 1.71293974, grad/param norm = 3.4717e-02, time/batch = 0.1693s 731/9500 (epoch 3.847), train_loss = 1.74060954, grad/param norm = 3.6998e-02, time/batch = 0.1681s 732/9500 (epoch 3.853), train_loss = 1.72798499, grad/param norm = 3.6935e-02, time/batch = 0.1698s 733/9500 (epoch 3.858), train_loss = 1.70368571, grad/param norm = 3.6158e-02, time/batch = 0.1691s 734/9500 (epoch 3.863), train_loss = 1.73405166, grad/param norm = 3.8619e-02, time/batch = 0.1696s 735/9500 (epoch 3.868), train_loss = 1.70903681, grad/param norm = 4.1595e-02, time/batch = 0.1700s 736/9500 (epoch 3.874), train_loss = 1.71198540, grad/param norm = 4.4412e-02, time/batch = 0.1695s 737/9500 (epoch 3.879), train_loss = 1.69003823, grad/param norm = 4.5561e-02, time/batch = 0.1693s 738/9500 (epoch 3.884), train_loss = 1.71900720, grad/param norm = 4.2549e-02, time/batch = 0.1702s 739/9500 (epoch 3.889), train_loss = 1.72340558, grad/param norm = 4.1286e-02, time/batch = 0.1691s 740/9500 (epoch 3.895), train_loss = 1.67800264, grad/param norm = 4.0967e-02, time/batch = 0.1691s 741/9500 (epoch 3.900), train_loss = 1.68586576, grad/param norm = 4.1443e-02, time/batch = 0.1684s 742/9500 (epoch 3.905), train_loss = 1.73222011, grad/param norm = 4.2077e-02, time/batch = 0.1694s 743/9500 (epoch 3.911), train_loss = 1.68591813, grad/param norm = 4.2611e-02, time/batch = 0.1692s 744/9500 (epoch 3.916), train_loss = 1.68500135, grad/param norm = 4.0575e-02, time/batch = 0.1700s 745/9500 (epoch 3.921), train_loss = 1.67930648, grad/param norm = 3.0952e-02, time/batch = 0.1689s 746/9500 (epoch 3.926), train_loss = 1.70040827, grad/param norm = 3.1877e-02, time/batch = 0.1700s 747/9500 (epoch 3.932), train_loss = 1.68964041, grad/param norm = 3.2007e-02, time/batch = 0.1698s 748/9500 (epoch 3.937), train_loss = 1.72014636, grad/param norm = 3.3790e-02, time/batch = 0.1692s 749/9500 (epoch 3.942), train_loss = 1.71274555, grad/param norm = 3.9030e-02, time/batch = 0.1696s 750/9500 (epoch 3.947), train_loss = 1.73294349, grad/param norm = 3.9198e-02, time/batch = 0.1696s 751/9500 (epoch 3.953), train_loss = 1.73659447, grad/param norm = 4.3977e-02, time/batch = 0.1677s 752/9500 (epoch 3.958), train_loss = 1.73771305, grad/param norm = 4.4042e-02, time/batch = 0.1695s 753/9500 (epoch 3.963), train_loss = 1.71932483, grad/param norm = 3.7578e-02, time/batch = 0.1700s 754/9500 (epoch 3.968), train_loss = 1.69500429, grad/param norm = 3.3688e-02, time/batch = 0.1695s 755/9500 (epoch 3.974), train_loss = 1.70278014, grad/param norm = 3.2687e-02, time/batch = 0.1695s 756/9500 (epoch 3.979), train_loss = 1.71677118, grad/param norm = 3.4197e-02, time/batch = 0.1695s 757/9500 (epoch 3.984), train_loss = 1.69726287, grad/param norm = 3.3633e-02, time/batch = 0.1702s 758/9500 (epoch 3.989), train_loss = 1.72809759, grad/param norm = 3.3131e-02, time/batch = 0.1695s 759/9500 (epoch 3.995), train_loss = 1.72194815, grad/param norm = 3.0292e-02, time/batch = 0.1697s 760/9500 (epoch 4.000), train_loss = 1.73746883, grad/param norm = 3.1133e-02, time/batch = 0.1699s 761/9500 (epoch 4.005), train_loss = 1.84910239, grad/param norm = 3.6232e-02, time/batch = 0.1675s 762/9500 (epoch 4.011), train_loss = 1.72923987, grad/param norm = 4.0027e-02, time/batch = 0.1695s 763/9500 (epoch 4.016), train_loss = 1.75710629, grad/param norm = 3.7150e-02, time/batch = 0.1701s 764/9500 (epoch 4.021), train_loss = 1.70714529, grad/param norm = 3.4849e-02, time/batch = 0.1696s 765/9500 (epoch 4.026), train_loss = 1.74504089, grad/param norm = 3.6244e-02, time/batch = 0.1693s 766/9500 (epoch 4.032), train_loss = 1.73616397, grad/param norm = 3.9313e-02, time/batch = 0.1701s 767/9500 (epoch 4.037), train_loss = 1.68321520, grad/param norm = 3.5906e-02, time/batch = 0.1693s 768/9500 (epoch 4.042), train_loss = 1.68415693, grad/param norm = 3.3436e-02, time/batch = 0.1693s 769/9500 (epoch 4.047), train_loss = 1.70829377, grad/param norm = 3.4259e-02, time/batch = 0.1698s 770/9500 (epoch 4.053), train_loss = 1.70832944, grad/param norm = 3.4296e-02, time/batch = 0.1695s 771/9500 (epoch 4.058), train_loss = 1.73836421, grad/param norm = 3.2955e-02, time/batch = 0.1678s 772/9500 (epoch 4.063), train_loss = 1.67311016, grad/param norm = 3.2127e-02, time/batch = 0.1698s 773/9500 (epoch 4.068), train_loss = 1.68532416, grad/param norm = 3.3327e-02, time/batch = 0.1698s 774/9500 (epoch 4.074), train_loss = 1.71948500, grad/param norm = 3.4482e-02, time/batch = 0.1696s 775/9500 (epoch 4.079), train_loss = 1.69068276, grad/param norm = 3.1240e-02, time/batch = 0.1702s 776/9500 (epoch 4.084), train_loss = 1.75061802, grad/param norm = 3.2074e-02, time/batch = 0.1692s 777/9500 (epoch 4.089), train_loss = 1.73364853, grad/param norm = 3.4344e-02, time/batch = 0.1692s 778/9500 (epoch 4.095), train_loss = 1.69672415, grad/param norm = 3.7365e-02, time/batch = 0.1702s 779/9500 (epoch 4.100), train_loss = 1.69509566, grad/param norm = 3.6946e-02, time/batch = 0.1688s 780/9500 (epoch 4.105), train_loss = 1.74267907, grad/param norm = 3.8598e-02, time/batch = 0.1694s 781/9500 (epoch 4.111), train_loss = 1.66515404, grad/param norm = 4.1729e-02, time/batch = 0.1685s 782/9500 (epoch 4.116), train_loss = 1.68998295, grad/param norm = 4.6062e-02, time/batch = 0.1693s 783/9500 (epoch 4.121), train_loss = 1.69910591, grad/param norm = 4.6570e-02, time/batch = 0.1695s 784/9500 (epoch 4.126), train_loss = 1.71080169, grad/param norm = 4.0959e-02, time/batch = 0.1697s 785/9500 (epoch 4.132), train_loss = 1.70382817, grad/param norm = 3.7693e-02, time/batch = 0.1692s 786/9500 (epoch 4.137), train_loss = 1.73845530, grad/param norm = 3.9224e-02, time/batch = 0.1695s 787/9500 (epoch 4.142), train_loss = 1.69863048, grad/param norm = 4.0038e-02, time/batch = 0.1693s 788/9500 (epoch 4.147), train_loss = 1.69344724, grad/param norm = 3.9740e-02, time/batch = 0.1695s 789/9500 (epoch 4.153), train_loss = 1.71512477, grad/param norm = 3.7185e-02, time/batch = 0.1691s 790/9500 (epoch 4.158), train_loss = 1.69265177, grad/param norm = 3.2910e-02, time/batch = 0.1696s 791/9500 (epoch 4.163), train_loss = 1.70233213, grad/param norm = 3.3815e-02, time/batch = 0.1675s 792/9500 (epoch 4.168), train_loss = 1.66747679, grad/param norm = 3.4603e-02, time/batch = 0.1695s 793/9500 (epoch 4.174), train_loss = 1.67292951, grad/param norm = 3.3639e-02, time/batch = 0.1693s 794/9500 (epoch 4.179), train_loss = 1.73572232, grad/param norm = 3.1533e-02, time/batch = 0.1695s 795/9500 (epoch 4.184), train_loss = 1.68390242, grad/param norm = 3.2948e-02, time/batch = 0.1706s 796/9500 (epoch 4.189), train_loss = 1.70645335, grad/param norm = 3.4675e-02, time/batch = 0.1696s 797/9500 (epoch 4.195), train_loss = 1.68173735, grad/param norm = 3.3376e-02, time/batch = 0.1690s 798/9500 (epoch 4.200), train_loss = 1.70523606, grad/param norm = 3.1556e-02, time/batch = 0.1701s 799/9500 (epoch 4.205), train_loss = 1.65286048, grad/param norm = 2.9314e-02, time/batch = 0.1692s 800/9500 (epoch 4.211), train_loss = 1.70290766, grad/param norm = 3.1852e-02, time/batch = 0.1693s 801/9500 (epoch 4.216), train_loss = 1.68540107, grad/param norm = 3.1607e-02, time/batch = 0.1682s 802/9500 (epoch 4.221), train_loss = 1.69307506, grad/param norm = 3.1378e-02, time/batch = 0.1693s 803/9500 (epoch 4.226), train_loss = 1.70052163, grad/param norm = 3.3597e-02, time/batch = 0.1690s 804/9500 (epoch 4.232), train_loss = 1.70445857, grad/param norm = 3.3929e-02, time/batch = 0.1700s 805/9500 (epoch 4.237), train_loss = 1.68909266, grad/param norm = 3.2400e-02, time/batch = 0.1692s 806/9500 (epoch 4.242), train_loss = 1.69284573, grad/param norm = 3.3771e-02, time/batch = 0.1697s 807/9500 (epoch 4.247), train_loss = 1.70507795, grad/param norm = 3.5486e-02, time/batch = 0.1697s 808/9500 (epoch 4.253), train_loss = 1.68410130, grad/param norm = 3.4042e-02, time/batch = 0.1691s 809/9500 (epoch 4.258), train_loss = 1.69357463, grad/param norm = 3.6804e-02, time/batch = 0.1694s 810/9500 (epoch 4.263), train_loss = 1.65414316, grad/param norm = 3.7683e-02, time/batch = 0.1699s 811/9500 (epoch 4.268), train_loss = 1.64966578, grad/param norm = 3.5812e-02, time/batch = 0.1676s 812/9500 (epoch 4.274), train_loss = 1.66857878, grad/param norm = 3.3379e-02, time/batch = 0.1696s 813/9500 (epoch 4.279), train_loss = 1.68510762, grad/param norm = 3.2478e-02, time/batch = 0.1703s 814/9500 (epoch 4.284), train_loss = 1.70041412, grad/param norm = 3.3275e-02, time/batch = 0.1692s 815/9500 (epoch 4.289), train_loss = 1.65687522, grad/param norm = 3.7639e-02, time/batch = 0.1694s 816/9500 (epoch 4.295), train_loss = 1.67861206, grad/param norm = 3.7732e-02, time/batch = 0.1697s 817/9500 (epoch 4.300), train_loss = 1.70076107, grad/param norm = 3.1151e-02, time/batch = 0.1702s 818/9500 (epoch 4.305), train_loss = 1.69641311, grad/param norm = 3.1220e-02, time/batch = 0.1691s 819/9500 (epoch 4.311), train_loss = 1.66745172, grad/param norm = 3.5026e-02, time/batch = 0.1699s 820/9500 (epoch 4.316), train_loss = 1.69968755, grad/param norm = 3.5276e-02, time/batch = 0.1701s 821/9500 (epoch 4.321), train_loss = 1.69734132, grad/param norm = 3.4001e-02, time/batch = 0.1675s 822/9500 (epoch 4.326), train_loss = 1.69687637, grad/param norm = 3.3011e-02, time/batch = 0.1692s 823/9500 (epoch 4.332), train_loss = 1.68392624, grad/param norm = 3.2806e-02, time/batch = 0.1699s 824/9500 (epoch 4.337), train_loss = 1.68316220, grad/param norm = 3.5902e-02, time/batch = 0.1692s 825/9500 (epoch 4.342), train_loss = 1.67767254, grad/param norm = 3.7929e-02, time/batch = 0.1696s 826/9500 (epoch 4.347), train_loss = 1.66085459, grad/param norm = 3.4367e-02, time/batch = 0.1704s 827/9500 (epoch 4.353), train_loss = 1.67712373, grad/param norm = 3.7813e-02, time/batch = 0.1694s 828/9500 (epoch 4.358), train_loss = 1.64296116, grad/param norm = 3.4517e-02, time/batch = 0.1696s 829/9500 (epoch 4.363), train_loss = 1.62750533, grad/param norm = 3.0746e-02, time/batch = 0.1700s 830/9500 (epoch 4.368), train_loss = 1.66052310, grad/param norm = 3.3042e-02, time/batch = 0.1693s 831/9500 (epoch 4.374), train_loss = 1.66827621, grad/param norm = 3.4641e-02, time/batch = 0.1680s 832/9500 (epoch 4.379), train_loss = 1.66162167, grad/param norm = 3.2866e-02, time/batch = 0.1706s 833/9500 (epoch 4.384), train_loss = 1.67696622, grad/param norm = 3.2193e-02, time/batch = 0.1696s 834/9500 (epoch 4.389), train_loss = 1.66974986, grad/param norm = 3.3135e-02, time/batch = 0.1697s 835/9500 (epoch 4.395), train_loss = 1.66859882, grad/param norm = 3.1017e-02, time/batch = 0.1702s 836/9500 (epoch 4.400), train_loss = 1.69916052, grad/param norm = 3.3128e-02, time/batch = 0.1697s 837/9500 (epoch 4.405), train_loss = 1.71771247, grad/param norm = 3.6049e-02, time/batch = 0.1699s 838/9500 (epoch 4.411), train_loss = 1.68828653, grad/param norm = 3.2667e-02, time/batch = 0.1709s 839/9500 (epoch 4.416), train_loss = 1.69377450, grad/param norm = 3.1784e-02, time/batch = 0.1693s 840/9500 (epoch 4.421), train_loss = 1.71324383, grad/param norm = 3.2364e-02, time/batch = 0.1697s 841/9500 (epoch 4.426), train_loss = 1.67354369, grad/param norm = 3.5824e-02, time/batch = 0.1686s 842/9500 (epoch 4.432), train_loss = 1.66145486, grad/param norm = 3.4614e-02, time/batch = 0.1692s 843/9500 (epoch 4.437), train_loss = 1.69032756, grad/param norm = 3.4409e-02, time/batch = 0.1693s 844/9500 (epoch 4.442), train_loss = 1.69635531, grad/param norm = 3.4377e-02, time/batch = 0.1693s 845/9500 (epoch 4.447), train_loss = 1.67571386, grad/param norm = 3.5496e-02, time/batch = 0.1693s 846/9500 (epoch 4.453), train_loss = 1.68532317, grad/param norm = 3.5646e-02, time/batch = 0.1700s 847/9500 (epoch 4.458), train_loss = 1.66196614, grad/param norm = 3.3800e-02, time/batch = 0.1694s 848/9500 (epoch 4.463), train_loss = 1.70427778, grad/param norm = 3.3331e-02, time/batch = 0.1700s 849/9500 (epoch 4.468), train_loss = 1.67554153, grad/param norm = 3.5593e-02, time/batch = 0.1693s 850/9500 (epoch 4.474), train_loss = 1.67805842, grad/param norm = 3.6657e-02, time/batch = 0.1691s 851/9500 (epoch 4.479), train_loss = 1.66183797, grad/param norm = 3.7103e-02, time/batch = 0.1685s 852/9500 (epoch 4.484), train_loss = 1.68452298, grad/param norm = 3.6480e-02, time/batch = 0.1694s 853/9500 (epoch 4.489), train_loss = 1.69640654, grad/param norm = 3.4951e-02, time/batch = 0.1702s 854/9500 (epoch 4.495), train_loss = 1.70235688, grad/param norm = 3.5472e-02, time/batch = 0.1701s 855/9500 (epoch 4.500), train_loss = 1.66385414, grad/param norm = 3.3043e-02, time/batch = 0.1693s 856/9500 (epoch 4.505), train_loss = 1.63838851, grad/param norm = 3.3034e-02, time/batch = 0.1694s 857/9500 (epoch 4.511), train_loss = 1.69584212, grad/param norm = 3.1342e-02, time/batch = 0.1699s 858/9500 (epoch 4.516), train_loss = 1.65695025, grad/param norm = 2.8823e-02, time/batch = 0.1696s 859/9500 (epoch 4.521), train_loss = 1.65720232, grad/param norm = 2.8859e-02, time/batch = 0.1696s 860/9500 (epoch 4.526), train_loss = 1.67545436, grad/param norm = 2.8094e-02, time/batch = 0.1700s 861/9500 (epoch 4.532), train_loss = 1.70237213, grad/param norm = 2.9727e-02, time/batch = 0.1680s 862/9500 (epoch 4.537), train_loss = 1.68935789, grad/param norm = 3.0745e-02, time/batch = 0.1693s 863/9500 (epoch 4.542), train_loss = 1.65004142, grad/param norm = 3.0606e-02, time/batch = 0.1706s 864/9500 (epoch 4.547), train_loss = 1.67958582, grad/param norm = 3.1302e-02, time/batch = 0.1697s 865/9500 (epoch 4.553), train_loss = 1.65028798, grad/param norm = 3.1387e-02, time/batch = 0.1691s 866/9500 (epoch 4.558), train_loss = 1.64009621, grad/param norm = 3.0035e-02, time/batch = 0.1705s 867/9500 (epoch 4.563), train_loss = 1.64705458, grad/param norm = 3.0802e-02, time/batch = 0.1698s 868/9500 (epoch 4.568), train_loss = 1.65758240, grad/param norm = 3.1543e-02, time/batch = 0.1695s 869/9500 (epoch 4.574), train_loss = 1.67436466, grad/param norm = 3.3443e-02, time/batch = 0.1695s 870/9500 (epoch 4.579), train_loss = 1.65568807, grad/param norm = 3.1547e-02, time/batch = 0.1696s 871/9500 (epoch 4.584), train_loss = 1.66286589, grad/param norm = 3.3907e-02, time/batch = 0.1676s 872/9500 (epoch 4.589), train_loss = 1.63673788, grad/param norm = 3.6569e-02, time/batch = 0.1694s 873/9500 (epoch 4.595), train_loss = 1.65604325, grad/param norm = 4.0041e-02, time/batch = 0.1699s 874/9500 (epoch 4.600), train_loss = 1.66198657, grad/param norm = 4.2221e-02, time/batch = 0.1695s 875/9500 (epoch 4.605), train_loss = 1.64800436, grad/param norm = 3.7810e-02, time/batch = 0.1693s 876/9500 (epoch 4.611), train_loss = 1.66949524, grad/param norm = 3.4123e-02, time/batch = 0.1694s 877/9500 (epoch 4.616), train_loss = 1.66157489, grad/param norm = 2.9860e-02, time/batch = 0.1696s 878/9500 (epoch 4.621), train_loss = 1.65762391, grad/param norm = 2.7941e-02, time/batch = 0.1697s 879/9500 (epoch 4.626), train_loss = 1.62202005, grad/param norm = 2.9629e-02, time/batch = 0.1697s 880/9500 (epoch 4.632), train_loss = 1.67640835, grad/param norm = 3.2587e-02, time/batch = 0.1693s 881/9500 (epoch 4.637), train_loss = 1.61996679, grad/param norm = 3.0431e-02, time/batch = 0.1679s 882/9500 (epoch 4.642), train_loss = 1.66797994, grad/param norm = 3.5335e-02, time/batch = 0.1699s 883/9500 (epoch 4.647), train_loss = 1.67702560, grad/param norm = 3.4762e-02, time/batch = 0.1693s 884/9500 (epoch 4.653), train_loss = 1.67072403, grad/param norm = 3.2903e-02, time/batch = 0.1693s 885/9500 (epoch 4.658), train_loss = 1.65085541, grad/param norm = 3.3333e-02, time/batch = 0.1703s 886/9500 (epoch 4.663), train_loss = 1.66171013, grad/param norm = 3.4721e-02, time/batch = 0.1694s 887/9500 (epoch 4.668), train_loss = 1.65543686, grad/param norm = 3.1265e-02, time/batch = 0.1692s 888/9500 (epoch 4.674), train_loss = 1.67150947, grad/param norm = 2.9587e-02, time/batch = 0.1703s 889/9500 (epoch 4.679), train_loss = 1.66555226, grad/param norm = 2.8626e-02, time/batch = 0.1695s 890/9500 (epoch 4.684), train_loss = 1.65576789, grad/param norm = 3.0017e-02, time/batch = 0.1699s 891/9500 (epoch 4.689), train_loss = 1.66307118, grad/param norm = 3.2335e-02, time/batch = 0.1685s 892/9500 (epoch 4.695), train_loss = 1.64577079, grad/param norm = 3.6755e-02, time/batch = 0.1696s 893/9500 (epoch 4.700), train_loss = 1.67214643, grad/param norm = 4.1275e-02, time/batch = 0.1693s 894/9500 (epoch 4.705), train_loss = 1.68925711, grad/param norm = 4.0973e-02, time/batch = 0.1699s 895/9500 (epoch 4.711), train_loss = 1.70573308, grad/param norm = 3.8978e-02, time/batch = 0.1695s 896/9500 (epoch 4.716), train_loss = 1.66014789, grad/param norm = 3.2731e-02, time/batch = 0.1690s 897/9500 (epoch 4.721), train_loss = 1.60224290, grad/param norm = 3.0528e-02, time/batch = 0.1703s 898/9500 (epoch 4.726), train_loss = 1.64024791, grad/param norm = 2.9035e-02, time/batch = 0.1695s 899/9500 (epoch 4.732), train_loss = 1.62264446, grad/param norm = 2.8376e-02, time/batch = 0.1691s 900/9500 (epoch 4.737), train_loss = 1.68484296, grad/param norm = 2.8212e-02, time/batch = 0.1696s 901/9500 (epoch 4.742), train_loss = 1.62961482, grad/param norm = 2.8651e-02, time/batch = 0.1675s 902/9500 (epoch 4.747), train_loss = 1.60998358, grad/param norm = 3.2463e-02, time/batch = 0.1690s 903/9500 (epoch 4.753), train_loss = 1.64074933, grad/param norm = 3.3213e-02, time/batch = 0.1698s 904/9500 (epoch 4.758), train_loss = 1.66365141, grad/param norm = 3.2918e-02, time/batch = 0.1694s 905/9500 (epoch 4.763), train_loss = 1.62952644, grad/param norm = 3.2209e-02, time/batch = 0.1695s 906/9500 (epoch 4.768), train_loss = 1.64807916, grad/param norm = 2.9828e-02, time/batch = 0.1697s 907/9500 (epoch 4.774), train_loss = 1.63682451, grad/param norm = 3.3154e-02, time/batch = 0.1702s 908/9500 (epoch 4.779), train_loss = 1.65010587, grad/param norm = 3.1245e-02, time/batch = 0.1696s 909/9500 (epoch 4.784), train_loss = 1.62872632, grad/param norm = 3.1844e-02, time/batch = 0.1695s 910/9500 (epoch 4.789), train_loss = 1.63452202, grad/param norm = 3.1856e-02, time/batch = 0.1704s 911/9500 (epoch 4.795), train_loss = 1.63445446, grad/param norm = 2.9030e-02, time/batch = 0.1679s 912/9500 (epoch 4.800), train_loss = 1.63899488, grad/param norm = 2.8729e-02, time/batch = 0.1698s 913/9500 (epoch 4.805), train_loss = 1.60223852, grad/param norm = 2.6382e-02, time/batch = 0.1700s 914/9500 (epoch 4.811), train_loss = 1.60930083, grad/param norm = 2.8544e-02, time/batch = 0.1693s 915/9500 (epoch 4.816), train_loss = 1.60181721, grad/param norm = 2.7536e-02, time/batch = 0.1695s 916/9500 (epoch 4.821), train_loss = 1.60203920, grad/param norm = 2.8407e-02, time/batch = 0.1701s 917/9500 (epoch 4.826), train_loss = 1.63418029, grad/param norm = 2.8586e-02, time/batch = 0.1694s 918/9500 (epoch 4.832), train_loss = 1.62646056, grad/param norm = 2.7871e-02, time/batch = 0.1698s 919/9500 (epoch 4.837), train_loss = 1.61954283, grad/param norm = 2.9192e-02, time/batch = 0.1699s 920/9500 (epoch 4.842), train_loss = 1.64262913, grad/param norm = 3.0302e-02, time/batch = 0.1694s 921/9500 (epoch 4.847), train_loss = 1.66500191, grad/param norm = 3.1456e-02, time/batch = 0.1678s 922/9500 (epoch 4.853), train_loss = 1.64307425, grad/param norm = 3.1235e-02, time/batch = 0.1698s 923/9500 (epoch 4.858), train_loss = 1.61042687, grad/param norm = 3.0643e-02, time/batch = 0.1696s 924/9500 (epoch 4.863), train_loss = 1.63392434, grad/param norm = 3.0235e-02, time/batch = 0.1694s 925/9500 (epoch 4.868), train_loss = 1.61581607, grad/param norm = 2.8926e-02, time/batch = 0.1699s 926/9500 (epoch 4.874), train_loss = 1.60089006, grad/param norm = 2.7676e-02, time/batch = 0.1693s 927/9500 (epoch 4.879), train_loss = 1.59109243, grad/param norm = 2.8219e-02, time/batch = 0.1697s 928/9500 (epoch 4.884), train_loss = 1.62405933, grad/param norm = 2.7545e-02, time/batch = 0.1696s 929/9500 (epoch 4.889), train_loss = 1.62654609, grad/param norm = 3.0146e-02, time/batch = 0.1683s 930/9500 (epoch 4.895), train_loss = 1.60034422, grad/param norm = 2.8618e-02, time/batch = 0.1693s 931/9500 (epoch 4.900), train_loss = 1.59652648, grad/param norm = 2.8139e-02, time/batch = 0.1676s 932/9500 (epoch 4.905), train_loss = 1.64072244, grad/param norm = 2.8822e-02, time/batch = 0.1690s 933/9500 (epoch 4.911), train_loss = 1.59011396, grad/param norm = 2.8430e-02, time/batch = 0.1694s 934/9500 (epoch 4.916), train_loss = 1.60858842, grad/param norm = 2.9285e-02, time/batch = 0.1696s 935/9500 (epoch 4.921), train_loss = 1.60042659, grad/param norm = 2.9119e-02, time/batch = 0.1706s 936/9500 (epoch 4.926), train_loss = 1.63264688, grad/param norm = 3.2313e-02, time/batch = 0.1697s 937/9500 (epoch 4.932), train_loss = 1.63113268, grad/param norm = 3.3992e-02, time/batch = 0.1692s 938/9500 (epoch 4.937), train_loss = 1.65355922, grad/param norm = 3.6723e-02, time/batch = 0.1700s 939/9500 (epoch 4.942), train_loss = 1.64333221, grad/param norm = 4.1211e-02, time/batch = 0.1693s 940/9500 (epoch 4.947), train_loss = 1.66768390, grad/param norm = 3.9564e-02, time/batch = 0.1700s 941/9500 (epoch 4.953), train_loss = 1.65494865, grad/param norm = 3.2262e-02, time/batch = 0.1684s 942/9500 (epoch 4.958), train_loss = 1.66035280, grad/param norm = 3.1301e-02, time/batch = 0.1692s 943/9500 (epoch 4.963), train_loss = 1.64335900, grad/param norm = 3.1343e-02, time/batch = 0.1694s 944/9500 (epoch 4.968), train_loss = 1.61113292, grad/param norm = 2.9278e-02, time/batch = 0.1703s 945/9500 (epoch 4.974), train_loss = 1.62212906, grad/param norm = 2.7581e-02, time/batch = 0.1697s 946/9500 (epoch 4.979), train_loss = 1.63742517, grad/param norm = 2.8057e-02, time/batch = 0.1696s 947/9500 (epoch 4.984), train_loss = 1.62823877, grad/param norm = 2.8806e-02, time/batch = 0.1706s 948/9500 (epoch 4.989), train_loss = 1.64514560, grad/param norm = 2.9481e-02, time/batch = 0.1691s 949/9500 (epoch 4.995), train_loss = 1.65037564, grad/param norm = 2.7383e-02, time/batch = 0.1696s 950/9500 (epoch 5.000), train_loss = 1.66151847, grad/param norm = 2.8288e-02, time/batch = 0.1702s 951/9500 (epoch 5.005), train_loss = 1.77705599, grad/param norm = 3.1189e-02, time/batch = 0.1672s 952/9500 (epoch 5.011), train_loss = 1.64908596, grad/param norm = 3.1772e-02, time/batch = 0.1690s 953/9500 (epoch 5.016), train_loss = 1.66444999, grad/param norm = 3.1093e-02, time/batch = 0.1705s 954/9500 (epoch 5.021), train_loss = 1.62766813, grad/param norm = 2.7987e-02, time/batch = 0.1694s 955/9500 (epoch 5.026), train_loss = 1.66020779, grad/param norm = 2.9121e-02, time/batch = 0.1695s 956/9500 (epoch 5.032), train_loss = 1.65251416, grad/param norm = 3.0783e-02, time/batch = 0.1692s 957/9500 (epoch 5.037), train_loss = 1.60465912, grad/param norm = 2.8084e-02, time/batch = 0.1702s 958/9500 (epoch 5.042), train_loss = 1.61021670, grad/param norm = 2.8620e-02, time/batch = 0.1695s 959/9500 (epoch 5.047), train_loss = 1.62438451, grad/param norm = 2.7550e-02, time/batch = 0.1692s 960/9500 (epoch 5.053), train_loss = 1.63201022, grad/param norm = 2.7815e-02, time/batch = 0.1698s 961/9500 (epoch 5.058), train_loss = 1.64732937, grad/param norm = 2.7494e-02, time/batch = 0.1680s 962/9500 (epoch 5.063), train_loss = 1.59930218, grad/param norm = 2.7709e-02, time/batch = 0.1691s 963/9500 (epoch 5.068), train_loss = 1.59902252, grad/param norm = 2.9308e-02, time/batch = 0.1705s 964/9500 (epoch 5.074), train_loss = 1.63271731, grad/param norm = 2.8591e-02, time/batch = 0.1697s 965/9500 (epoch 5.079), train_loss = 1.62295301, grad/param norm = 2.9227e-02, time/batch = 0.1702s 966/9500 (epoch 5.084), train_loss = 1.67438715, grad/param norm = 2.9675e-02, time/batch = 0.1712s 967/9500 (epoch 5.089), train_loss = 1.67603414, grad/param norm = 3.2029e-02, time/batch = 0.1698s 968/9500 (epoch 5.095), train_loss = 1.61446275, grad/param norm = 3.4218e-02, time/batch = 0.1692s 969/9500 (epoch 5.100), train_loss = 1.62601659, grad/param norm = 3.2989e-02, time/batch = 0.1704s 970/9500 (epoch 5.105), train_loss = 1.67570079, grad/param norm = 3.3527e-02, time/batch = 0.1694s 971/9500 (epoch 5.111), train_loss = 1.58885965, grad/param norm = 3.1664e-02, time/batch = 0.1682s 972/9500 (epoch 5.116), train_loss = 1.62055286, grad/param norm = 3.0688e-02, time/batch = 0.1708s 973/9500 (epoch 5.121), train_loss = 1.62080072, grad/param norm = 3.0555e-02, time/batch = 0.1694s 974/9500 (epoch 5.126), train_loss = 1.63117465, grad/param norm = 2.9566e-02, time/batch = 0.1695s 975/9500 (epoch 5.132), train_loss = 1.62619622, grad/param norm = 2.9124e-02, time/batch = 0.1706s 976/9500 (epoch 5.137), train_loss = 1.65507337, grad/param norm = 2.7847e-02, time/batch = 0.1696s 977/9500 (epoch 5.142), train_loss = 1.60157510, grad/param norm = 2.7896e-02, time/batch = 0.1694s 978/9500 (epoch 5.147), train_loss = 1.60030244, grad/param norm = 3.2202e-02, time/batch = 0.1704s 979/9500 (epoch 5.153), train_loss = 1.65531791, grad/param norm = 3.5065e-02, time/batch = 0.1697s 980/9500 (epoch 5.158), train_loss = 1.63282693, grad/param norm = 3.2780e-02, time/batch = 0.1695s 981/9500 (epoch 5.163), train_loss = 1.62918163, grad/param norm = 3.1312e-02, time/batch = 0.1689s 982/9500 (epoch 5.168), train_loss = 1.59889682, grad/param norm = 2.9411e-02, time/batch = 0.1697s 983/9500 (epoch 5.174), train_loss = 1.59179234, grad/param norm = 2.8294e-02, time/batch = 0.1696s 984/9500 (epoch 5.179), train_loss = 1.67189718, grad/param norm = 2.8097e-02, time/batch = 0.1696s 985/9500 (epoch 5.184), train_loss = 1.61507277, grad/param norm = 2.8062e-02, time/batch = 0.1694s 986/9500 (epoch 5.189), train_loss = 1.63122194, grad/param norm = 2.8934e-02, time/batch = 0.1697s 987/9500 (epoch 5.195), train_loss = 1.61473113, grad/param norm = 2.7248e-02, time/batch = 0.1698s 988/9500 (epoch 5.200), train_loss = 1.62698888, grad/param norm = 2.7496e-02, time/batch = 0.1694s 989/9500 (epoch 5.205), train_loss = 1.59136685, grad/param norm = 2.8788e-02, time/batch = 0.1698s 990/9500 (epoch 5.211), train_loss = 1.62315706, grad/param norm = 2.8934e-02, time/batch = 0.1698s 991/9500 (epoch 5.216), train_loss = 1.61051164, grad/param norm = 2.8938e-02, time/batch = 0.1686s 992/9500 (epoch 5.221), train_loss = 1.62514684, grad/param norm = 2.7790e-02, time/batch = 0.1693s 993/9500 (epoch 5.226), train_loss = 1.61459715, grad/param norm = 2.8618e-02, time/batch = 0.1695s 994/9500 (epoch 5.232), train_loss = 1.63715502, grad/param norm = 2.9638e-02, time/batch = 0.1705s 995/9500 (epoch 5.237), train_loss = 1.60970640, grad/param norm = 2.9066e-02, time/batch = 0.1694s 996/9500 (epoch 5.242), train_loss = 1.62929763, grad/param norm = 3.0512e-02, time/batch = 0.1697s 997/9500 (epoch 5.247), train_loss = 1.62241317, grad/param norm = 2.8995e-02, time/batch = 0.1705s 998/9500 (epoch 5.253), train_loss = 1.61970294, grad/param norm = 2.8253e-02, time/batch = 0.1692s 999/9500 (epoch 5.258), train_loss = 1.64353000, grad/param norm = 2.8038e-02, time/batch = 0.1695s evaluating loss over split index 2
1/10... 2/10... 3/10... 4/10... 5/10... 6/10... 7/10... 8/10... 9/10... 10/10...
saving checkpoint to cv/lm_lstm_epoch5.26_1.4981.t7 1000/9500 (epoch 5.263), train_loss = 1.58477830, grad/param norm = 2.6291e-02, time/batch = 0.1699s
1001/9500 (epoch 5.268), train_loss = 1.71356881, grad/param norm = 2.7578e-02, time/batch = 0.1680s
1002/9500 (epoch 5.274), train_loss = 1.61021095, grad/param norm = 2.6320e-02, time/batch = 0.1696s
1003/9500 (epoch 5.279), train_loss = 1.63006354, grad/param norm = 2.6899e-02, time/batch = 0.1698s
1004/9500 (epoch 5.284), train_loss = 1.63025910, grad/param norm = 2.8030e-02, time/batch = 0.1693s
1005/9500 (epoch 5.289), train_loss = 1.58297621, grad/param norm = 3.1094e-02, time/batch = 0.1704s
1006/9500 (epoch 5.295), train_loss = 1.60252831, grad/param norm = 3.1455e-02, time/batch = 0.1692s
1007/9500 (epoch 5.300), train_loss = 1.64592949, grad/param norm = 3.1501e-02, time/batch = 0.1691s
1008/9500 (epoch 5.305), train_loss = 1.63222276, grad/param norm = 2.9764e-02, time/batch = 0.1702s
1009/9500 (epoch 5.311), train_loss = 1.59896771, grad/param norm = 3.0159e-02, time/batch = 0.1694s
1010/9500 (epoch 5.316), train_loss = 1.61894051, grad/param norm = 3.1067e-02, time/batch = 0.1698s
1011/9500 (epoch 5.321), train_loss = 1.62694973, grad/param norm = 3.1807e-02, time/batch = 0.1679s
1012/9500 (epoch 5.326), train_loss = 1.62513082, grad/param norm = 3.1709e-02, time/batch = 0.1683s
1013/9500 (epoch 5.332), train_loss = 1.62816175, grad/param norm = 3.1928e-02, time/batch = 0.1693s
1014/9500 (epoch 5.337), train_loss = 1.61013783, grad/param norm = 3.0056e-02, time/batch = 0.1696s
1015/9500 (epoch 5.342), train_loss = 1.60113346, grad/param norm = 2.8828e-02, time/batch = 0.1702s
1016/9500 (epoch 5.347), train_loss = 1.58151573, grad/param norm = 2.9244e-02, time/batch = 0.1693s
1017/9500 (epoch 5.353), train_loss = 1.61923838, grad/param norm = 2.9916e-02, time/batch = 0.1691s
1018/9500 (epoch 5.358), train_loss = 1.57432884, grad/param norm = 2.8124e-02, time/batch = 0.1703s
1019/9500 (epoch 5.363), train_loss = 1.57148386, grad/param norm = 2.9131e-02, time/batch = 0.1693s
1020/9500 (epoch 5.368), train_loss = 1.59019744, grad/param norm = 3.2256e-02, time/batch = 0.1697s
1021/9500 (epoch 5.374), train_loss = 1.60132719, grad/param norm = 3.2881e-02, time/batch = 0.1683s
1022/9500 (epoch 5.379), train_loss = 1.61264314, grad/param norm = 3.2418e-02, time/batch = 0.1698s
1023/9500 (epoch 5.384), train_loss = 1.60490526, grad/param norm = 3.1095e-02, time/batch = 0.1713s
1024/9500 (epoch 5.389), train_loss = 1.61464143, grad/param norm = 2.8774e-02, time/batch = 0.1716s
1025/9500 (epoch 5.395), train_loss = 1.60085696, grad/param norm = 2.6535e-02, time/batch = 0.1706s
1026/9500 (epoch 5.400), train_loss = 1.62712472, grad/param norm = 2.7542e-02, time/batch = 0.1706s
1027/9500 (epoch 5.405), train_loss = 1.63245245, grad/param norm = 2.7683e-02, time/batch = 0.1716s
1028/9500 (epoch 5.411), train_loss = 1.62059371, grad/param norm = 2.6345e-02, time/batch = 0.1710s
1029/9500 (epoch 5.416), train_loss = 1.62355596, grad/param norm = 2.7788e-02, time/batch = 0.1710s
1030/9500 (epoch 5.421), train_loss = 1.63999403, grad/param norm = 2.6580e-02, time/batch = 0.1718s
1031/9500 (epoch 5.426), train_loss = 1.60270127, grad/param norm = 2.7163e-02, time/batch = 0.1678s
1032/9500 (epoch 5.432), train_loss = 1.58458433, grad/param norm = 2.8376e-02, time/batch = 0.1713s
1033/9500 (epoch 5.437), train_loss = 1.62882029, grad/param norm = 2.9634e-02, time/batch = 0.1719s
1034/9500 (epoch 5.442), train_loss = 1.62154590, grad/param norm = 2.7524e-02, time/batch = 0.1711s
1035/9500 (epoch 5.447), train_loss = 1.59972684, grad/param norm = 2.8254e-02, time/batch = 0.1708s
1036/9500 (epoch 5.453), train_loss = 1.63209802, grad/param norm = 2.8794e-02, time/batch = 0.1718s
1037/9500 (epoch 5.458), train_loss = 1.60023825, grad/param norm = 2.7858e-02, time/batch = 0.1709s
1038/9500 (epoch 5.463), train_loss = 1.62562241, grad/param norm = 2.6270e-02, time/batch = 0.1712s
1039/9500 (epoch 5.468), train_loss = 1.60725322, grad/param norm = 2.6803e-02, time/batch = 0.1708s
1040/9500 (epoch 5.474), train_loss = 1.61762009, grad/param norm = 2.6867e-02, time/batch = 0.1709s
1041/9500 (epoch 5.479), train_loss = 1.59191772, grad/param norm = 2.6272e-02, time/batch = 0.1680s
1042/9500 (epoch 5.484), train_loss = 1.60194592, grad/param norm = 2.8042e-02, time/batch = 0.1713s
1043/9500 (epoch 5.489), train_loss = 1.61496329, grad/param norm = 2.7836e-02, time/batch = 0.1714s
1044/9500 (epoch 5.495), train_loss = 1.62394652, grad/param norm = 2.7476e-02, time/batch = 0.1713s
1045/9500 (epoch 5.500), train_loss = 1.60691760, grad/param norm = 2.6228e-02, time/batch = 0.1712s
1046/9500 (epoch 5.505), train_loss = 1.57751009, grad/param norm = 2.8069e-02, time/batch = 0.1710s
1047/9500 (epoch 5.511), train_loss = 1.63151429, grad/param norm = 2.9989e-02, time/batch = 0.1712s
1048/9500 (epoch 5.516), train_loss = 1.61945364, grad/param norm = 2.7706e-02, time/batch = 0.1711s
1049/9500 (epoch 5.521), train_loss = 1.61136587, grad/param norm = 2.6306e-02, time/batch = 0.1717s
1050/9500 (epoch 5.526), train_loss = 1.61016889, grad/param norm = 2.6714e-02, time/batch = 0.1710s
1051/9500 (epoch 5.532), train_loss = 1.64574303, grad/param norm = 2.7422e-02, time/batch = 0.1678s
1052/9500 (epoch 5.537), train_loss = 1.62995796, grad/param norm = 2.8214e-02, time/batch = 0.1714s
1053/9500 (epoch 5.542), train_loss = 1.60001548, grad/param norm = 3.1045e-02, time/batch = 0.1716s
1054/9500 (epoch 5.547), train_loss = 1.61024321, grad/param norm = 3.1720e-02, time/batch = 0.1716s
1055/9500 (epoch 5.553), train_loss = 1.58601625, grad/param norm = 3.0616e-02, time/batch = 0.1718s
1056/9500 (epoch 5.558), train_loss = 1.59756952, grad/param norm = 2.8764e-02, time/batch = 0.1711s
1057/9500 (epoch 5.563), train_loss = 1.58861377, grad/param norm = 3.0921e-02, time/batch = 0.1709s
1058/9500 (epoch 5.568), train_loss = 1.61496736, grad/param norm = 3.4332e-02, time/batch = 0.1720s
1059/9500 (epoch 5.574), train_loss = 1.62116094, grad/param norm = 3.2935e-02, time/batch = 0.1715s
1060/9500 (epoch 5.579), train_loss = 1.58575966, grad/param norm = 3.1127e-02, time/batch = 0.1710s
1061/9500 (epoch 5.584), train_loss = 1.60926470, grad/param norm = 2.7237e-02, time/batch = 0.1685s
1062/9500 (epoch 5.589), train_loss = 1.56545387, grad/param norm = 2.4923e-02, time/batch = 0.1713s
1063/9500 (epoch 5.595), train_loss = 1.57381443, grad/param norm = 2.6798e-02, time/batch = 0.1709s
1064/9500 (epoch 5.600), train_loss = 1.58619361, grad/param norm = 2.8047e-02, time/batch = 0.1717s
1065/9500 (epoch 5.605), train_loss = 1.59352496, grad/param norm = 2.9392e-02, time/batch = 0.1710s
1066/9500 (epoch 5.611), train_loss = 1.60184913, grad/param norm = 2.9223e-02, time/batch = 0.1715s
1067/9500 (epoch 5.616), train_loss = 1.59964491, grad/param norm = 2.9870e-02, time/batch = 0.1717s
1068/9500 (epoch 5.621), train_loss = 1.60653246, grad/param norm = 2.8714e-02, time/batch = 0.1717s
1069/9500 (epoch 5.626), train_loss = 1.56526207, grad/param norm = 2.5508e-02, time/batch = 0.1706s
1070/9500 (epoch 5.632), train_loss = 1.61220764, grad/param norm = 2.7056e-02, time/batch = 0.1712s
1071/9500 (epoch 5.637), train_loss = 1.56570190, grad/param norm = 2.6016e-02, time/batch = 0.1687s
1072/9500 (epoch 5.642), train_loss = 1.60773491, grad/param norm = 2.7178e-02, time/batch = 0.1707s
1073/9500 (epoch 5.647), train_loss = 1.60162445, grad/param norm = 2.6090e-02, time/batch = 0.1710s
1074/9500 (epoch 5.653), train_loss = 1.60375334, grad/param norm = 2.6196e-02, time/batch = 0.1717s
1075/9500 (epoch 5.658), train_loss = 1.57944025, grad/param norm = 2.7137e-02, time/batch = 0.1712s
1076/9500 (epoch 5.663), train_loss = 1.58492365, grad/param norm = 2.9692e-02, time/batch = 0.1709s
1077/9500 (epoch 5.668), train_loss = 1.57500605, grad/param norm = 2.6682e-02, time/batch = 0.1714s
1078/9500 (epoch 5.674), train_loss = 1.61110756, grad/param norm = 2.5913e-02, time/batch = 0.1708s
1079/9500 (epoch 5.679), train_loss = 1.60189772, grad/param norm = 2.7624e-02, time/batch = 0.1708s
1080/9500 (epoch 5.684), train_loss = 1.60074795, grad/param norm = 2.8561e-02, time/batch = 0.1714s
1081/9500 (epoch 5.689), train_loss = 1.60558249, grad/param norm = 2.9131e-02, time/batch = 0.1679s
1082/9500 (epoch 5.695), train_loss = 1.60236510, grad/param norm = 2.9468e-02, time/batch = 0.1708s
1083/9500 (epoch 5.700), train_loss = 1.60292503, grad/param norm = 2.9190e-02, time/batch = 0.1714s
1084/9500 (epoch 5.705), train_loss = 1.63112989, grad/param norm = 2.9433e-02, time/batch = 0.1704s
1085/9500 (epoch 5.711), train_loss = 1.64094972, grad/param norm = 2.9567e-02, time/batch = 0.1711s
1086/9500 (epoch 5.716), train_loss = 1.59924020, grad/param norm = 2.5955e-02, time/batch = 0.1722s
1087/9500 (epoch 5.721), train_loss = 1.54466332, grad/param norm = 2.5541e-02, time/batch = 0.1709s
1088/9500 (epoch 5.726), train_loss = 1.58856917, grad/param norm = 2.6235e-02, time/batch = 0.1710s
1089/9500 (epoch 5.732), train_loss = 1.58076847, grad/param norm = 2.7265e-02, time/batch = 0.1717s
1090/9500 (epoch 5.737), train_loss = 1.61948328, grad/param norm = 2.7359e-02, time/batch = 0.1710s
1091/9500 (epoch 5.742), train_loss = 1.57474860, grad/param norm = 2.6192e-02, time/batch = 0.1678s
1092/9500 (epoch 5.747), train_loss = 1.56139510, grad/param norm = 2.6241e-02, time/batch = 0.1718s
1093/9500 (epoch 5.753), train_loss = 1.58093657, grad/param norm = 2.6137e-02, time/batch = 0.1707s
1094/9500 (epoch 5.758), train_loss = 1.59754069, grad/param norm = 2.7335e-02, time/batch = 0.1712s
1095/9500 (epoch 5.763), train_loss = 1.56909133, grad/param norm = 2.6058e-02, time/batch = 0.1707s
1096/9500 (epoch 5.768), train_loss = 1.58684452, grad/param norm = 2.5606e-02, time/batch = 0.1708s
1097/9500 (epoch 5.774), train_loss = 1.58023751, grad/param norm = 3.1555e-02, time/batch = 0.1714s
1098/9500 (epoch 5.779), train_loss = 1.60079907, grad/param norm = 2.9827e-02, time/batch = 0.1712s
1099/9500 (epoch 5.784), train_loss = 1.59114204, grad/param norm = 2.8857e-02, time/batch = 0.1707s
1100/9500 (epoch 5.789), train_loss = 1.58133795, grad/param norm = 2.7795e-02, time/batch = 0.1712s
1101/9500 (epoch 5.795), train_loss = 1.57104543, grad/param norm = 2.7598e-02, time/batch = 0.1680s
1102/9500 (epoch 5.800), train_loss = 1.59475090, grad/param norm = 2.7435e-02, time/batch = 0.1717s
1103/9500 (epoch 5.805), train_loss = 1.54757036, grad/param norm = 2.4524e-02, time/batch = 0.1707s
1104/9500 (epoch 5.811), train_loss = 1.55600042, grad/param norm = 2.5618e-02, time/batch = 0.1708s
1105/9500 (epoch 5.816), train_loss = 1.54878673, grad/param norm = 2.4830e-02, time/batch = 0.1710s
1106/9500 (epoch 5.821), train_loss = 1.54537115, grad/param norm = 2.5483e-02, time/batch = 0.1708s
1107/9500 (epoch 5.826), train_loss = 1.57810329, grad/param norm = 2.6457e-02, time/batch = 0.1710s
1108/9500 (epoch 5.832), train_loss = 1.57032775, grad/param norm = 2.6506e-02, time/batch = 0.1716s
1109/9500 (epoch 5.837), train_loss = 1.57145359, grad/param norm = 2.8057e-02, time/batch = 0.1711s
1110/9500 (epoch 5.842), train_loss = 1.58273715, grad/param norm = 2.8427e-02, time/batch = 0.1707s
1111/9500 (epoch 5.847), train_loss = 1.58339036, grad/param norm = 2.7252e-02, time/batch = 0.1686s
1112/9500 (epoch 5.853), train_loss = 1.59040709, grad/param norm = 2.6443e-02, time/batch = 0.1714s
1113/9500 (epoch 5.858), train_loss = 1.55637665, grad/param norm = 2.5121e-02, time/batch = 0.1712s
1114/9500 (epoch 5.863), train_loss = 1.58591290, grad/param norm = 2.5833e-02, time/batch = 0.1721s
1115/9500 (epoch 5.868), train_loss = 1.56070707, grad/param norm = 2.7682e-02, time/batch = 0.1707s
1116/9500 (epoch 5.874), train_loss = 1.55226213, grad/param norm = 3.0444e-02, time/batch = 0.1707s
1117/9500 (epoch 5.879), train_loss = 1.54949378, grad/param norm = 2.8687e-02, time/batch = 0.1717s
1118/9500 (epoch 5.884), train_loss = 1.56988122, grad/param norm = 2.6302e-02, time/batch = 0.1710s
1119/9500 (epoch 5.889), train_loss = 1.58171265, grad/param norm = 2.6720e-02, time/batch = 0.1710s
1120/9500 (epoch 5.895), train_loss = 1.53821108, grad/param norm = 2.5877e-02, time/batch = 0.1717s
1121/9500 (epoch 5.900), train_loss = 1.53973559, grad/param norm = 2.5799e-02, time/batch = 0.1679s
1122/9500 (epoch 5.905), train_loss = 1.59484873, grad/param norm = 2.6945e-02, time/batch = 0.1714s
1123/9500 (epoch 5.911), train_loss = 1.53801155, grad/param norm = 2.5474e-02, time/batch = 0.1711s
1124/9500 (epoch 5.916), train_loss = 1.55928077, grad/param norm = 2.6353e-02, time/batch = 0.1700s
1125/9500 (epoch 5.921), train_loss = 1.55370684, grad/param norm = 2.5800e-02, time/batch = 0.1710s
1126/9500 (epoch 5.926), train_loss = 1.58016362, grad/param norm = 2.6609e-02, time/batch = 0.1707s
1127/9500 (epoch 5.932), train_loss = 1.56180913, grad/param norm = 2.5708e-02, time/batch = 0.1720s
1128/9500 (epoch 5.937), train_loss = 1.59221177, grad/param norm = 2.6835e-02, time/batch = 0.1710s
1129/9500 (epoch 5.942), train_loss = 1.57146146, grad/param norm = 2.8855e-02, time/batch = 0.1706s
1130/9500 (epoch 5.947), train_loss = 1.58948232, grad/param norm = 2.7271e-02, time/batch = 0.1714s
1131/9500 (epoch 5.953), train_loss = 1.59540340, grad/param norm = 2.4874e-02, time/batch = 0.1679s
1132/9500 (epoch 5.958), train_loss = 1.59555402, grad/param norm = 2.5575e-02, time/batch = 0.1707s
1133/9500 (epoch 5.963), train_loss = 1.57336476, grad/param norm = 2.5382e-02, time/batch = 0.1714s
1134/9500 (epoch 5.968), train_loss = 1.56566529, grad/param norm = 2.5334e-02, time/batch = 0.1711s
1135/9500 (epoch 5.974), train_loss = 1.56498470, grad/param norm = 2.4828e-02, time/batch = 0.1710s
1136/9500 (epoch 5.979), train_loss = 1.58861147, grad/param norm = 2.7582e-02, time/batch = 0.1718s
1137/9500 (epoch 5.984), train_loss = 1.57496490, grad/param norm = 2.8181e-02, time/batch = 0.1711s
1138/9500 (epoch 5.989), train_loss = 1.59986017, grad/param norm = 2.8200e-02, time/batch = 0.1713s
1139/9500 (epoch 5.995), train_loss = 1.59599957, grad/param norm = 2.8425e-02, time/batch = 0.1720s
1140/9500 (epoch 6.000), train_loss = 1.60858932, grad/param norm = 2.7920e-02, time/batch = 0.1715s
1141/9500 (epoch 6.005), train_loss = 1.73984857, grad/param norm = 3.0110e-02, time/batch = 0.1679s
1142/9500 (epoch 6.011), train_loss = 1.58995754, grad/param norm = 3.1278e-02, time/batch = 0.1718s
1143/9500 (epoch 6.016), train_loss = 1.63107630, grad/param norm = 2.9110e-02, time/batch = 0.1709s
1144/9500 (epoch 6.021), train_loss = 1.56470308, grad/param norm = 2.6084e-02, time/batch = 0.1710s
1145/9500 (epoch 6.026), train_loss = 1.61440370, grad/param norm = 2.6398e-02, time/batch = 0.1717s
1146/9500 (epoch 6.032), train_loss = 1.59324164, grad/param norm = 2.7083e-02, time/batch = 0.1707s
1147/9500 (epoch 6.037), train_loss = 1.55868971, grad/param norm = 2.4698e-02, time/batch = 0.1708s
1148/9500 (epoch 6.042), train_loss = 1.56878395, grad/param norm = 2.5445e-02, time/batch = 0.1716s
1149/9500 (epoch 6.047), train_loss = 1.57619318, grad/param norm = 2.5382e-02, time/batch = 0.1706s
1150/9500 (epoch 6.053), train_loss = 1.58639514, grad/param norm = 2.6805e-02, time/batch = 0.1704s
1151/9500 (epoch 6.058), train_loss = 1.60905317, grad/param norm = 2.7297e-02, time/batch = 0.1693s
1152/9500 (epoch 6.063), train_loss = 1.54873623, grad/param norm = 2.5672e-02, time/batch = 0.1710s
1153/9500 (epoch 6.068), train_loss = 1.55265088, grad/param norm = 2.6312e-02, time/batch = 0.1710s
1154/9500 (epoch 6.074), train_loss = 1.57865137, grad/param norm = 2.8123e-02, time/batch = 0.1714s
1155/9500 (epoch 6.079), train_loss = 1.56338261, grad/param norm = 2.4877e-02, time/batch = 0.1714s
1156/9500 (epoch 6.084), train_loss = 1.62495264, grad/param norm = 2.5598e-02, time/batch = 0.1707s
1157/9500 (epoch 6.089), train_loss = 1.61622963, grad/param norm = 2.7610e-02, time/batch = 0.1712s
1158/9500 (epoch 6.095), train_loss = 1.56827593, grad/param norm = 2.8256e-02, time/batch = 0.1705s
1159/9500 (epoch 6.100), train_loss = 1.58569304, grad/param norm = 2.5937e-02, time/batch = 0.1713s
1160/9500 (epoch 6.105), train_loss = 1.61280133, grad/param norm = 2.6153e-02, time/batch = 0.1711s
1161/9500 (epoch 6.111), train_loss = 1.53331320, grad/param norm = 2.4454e-02, time/batch = 0.1687s
1162/9500 (epoch 6.116), train_loss = 1.55069329, grad/param norm = 2.5777e-02, time/batch = 0.1708s
1163/9500 (epoch 6.121), train_loss = 1.55052384, grad/param norm = 2.6012e-02, time/batch = 0.1709s
1164/9500 (epoch 6.126), train_loss = 1.57263940, grad/param norm = 2.6088e-02, time/batch = 0.1708s
1165/9500 (epoch 6.132), train_loss = 1.58153190, grad/param norm = 2.6584e-02, time/batch = 0.1708s
1166/9500 (epoch 6.137), train_loss = 1.59831739, grad/param norm = 2.5384e-02, time/batch = 0.1706s
1167/9500 (epoch 6.142), train_loss = 1.56047926, grad/param norm = 2.4924e-02, time/batch = 0.1726s
1168/9500 (epoch 6.147), train_loss = 1.54751500, grad/param norm = 2.6674e-02, time/batch = 0.1706s
1169/9500 (epoch 6.153), train_loss = 1.59067149, grad/param norm = 2.6495e-02, time/batch = 0.1713s
1170/9500 (epoch 6.158), train_loss = 1.58233874, grad/param norm = 2.4550e-02, time/batch = 0.1716s
1171/9500 (epoch 6.163), train_loss = 1.58411314, grad/param norm = 2.7291e-02, time/batch = 0.1681s
1172/9500 (epoch 6.168), train_loss = 1.55041063, grad/param norm = 2.7612e-02, time/batch = 0.1712s
1173/9500 (epoch 6.174), train_loss = 1.56140073, grad/param norm = 2.9202e-02, time/batch = 0.1714s
1174/9500 (epoch 6.179), train_loss = 1.62295443, grad/param norm = 2.9280e-02, time/batch = 0.1710s
1175/9500 (epoch 6.184), train_loss = 1.56843423, grad/param norm = 2.9110e-02, time/batch = 0.1709s
1176/9500 (epoch 6.189), train_loss = 1.60055677, grad/param norm = 2.9198e-02, time/batch = 0.1717s
1177/9500 (epoch 6.195), train_loss = 1.57765967, grad/param norm = 2.7748e-02, time/batch = 0.1709s
1178/9500 (epoch 6.200), train_loss = 1.58147513, grad/param norm = 2.7744e-02, time/batch = 0.1708s
1179/9500 (epoch 6.205), train_loss = 1.54784093, grad/param norm = 2.8430e-02, time/batch = 0.1710s
1180/9500 (epoch 6.211), train_loss = 1.58361659, grad/param norm = 2.8003e-02, time/batch = 0.1720s
1181/9500 (epoch 6.216), train_loss = 1.55975837, grad/param norm = 2.7251e-02, time/batch = 0.1677s
1182/9500 (epoch 6.221), train_loss = 1.58656626, grad/param norm = 2.5014e-02, time/batch = 0.1709s
1183/9500 (epoch 6.226), train_loss = 1.56164750, grad/param norm = 2.4552e-02, time/batch = 0.1716s
1184/9500 (epoch 6.232), train_loss = 1.57823615, grad/param norm = 2.5413e-02, time/batch = 0.1712s
1185/9500 (epoch 6.237), train_loss = 1.56140816, grad/param norm = 2.4587e-02, time/batch = 0.1711s
1186/9500 (epoch 6.242), train_loss = 1.58552214, grad/param norm = 2.5522e-02, time/batch = 0.1717s
1187/9500 (epoch 6.247), train_loss = 1.57030070, grad/param norm = 2.5697e-02, time/batch = 0.1713s
1188/9500 (epoch 6.253), train_loss = 1.56635670, grad/param norm = 2.4729e-02, time/batch = 0.1711s
1189/9500 (epoch 6.258), train_loss = 1.57043375, grad/param norm = 2.4269e-02, time/batch = 0.1717s
1190/9500 (epoch 6.263), train_loss = 1.53808373, grad/param norm = 2.4381e-02, time/batch = 0.1707s
1191/9500 (epoch 6.268), train_loss = 1.53994649, grad/param norm = 2.4872e-02, time/batch = 0.1683s
1192/9500 (epoch 6.274), train_loss = 1.55982841, grad/param norm = 2.4594e-02, time/batch = 0.1717s
1193/9500 (epoch 6.279), train_loss = 1.57214562, grad/param norm = 2.5002e-02, time/batch = 0.1709s
1194/9500 (epoch 6.284), train_loss = 1.56773368, grad/param norm = 2.5845e-02, time/batch = 0.1712s
1195/9500 (epoch 6.289), train_loss = 1.54561214, grad/param norm = 2.9292e-02, time/batch = 0.1714s
1196/9500 (epoch 6.295), train_loss = 1.55530141, grad/param norm = 2.8230e-02, time/batch = 0.1710s
1197/9500 (epoch 6.300), train_loss = 1.58661561, grad/param norm = 2.4611e-02, time/batch = 0.1712s
1198/9500 (epoch 6.305), train_loss = 1.57494878, grad/param norm = 2.4050e-02, time/batch = 0.1718s
1199/9500 (epoch 6.311), train_loss = 1.54837640, grad/param norm = 2.4839e-02, time/batch = 0.1708s
1200/9500 (epoch 6.316), train_loss = 1.57026181, grad/param norm = 2.6232e-02, time/batch = 0.1710s
1201/9500 (epoch 6.321), train_loss = 1.58378760, grad/param norm = 2.5740e-02, time/batch = 0.1687s
1202/9500 (epoch 6.326), train_loss = 1.57102396, grad/param norm = 2.5527e-02, time/batch = 0.1711s
1203/9500 (epoch 6.332), train_loss = 1.57766544, grad/param norm = 2.5115e-02, time/batch = 0.1713s
1204/9500 (epoch 6.337), train_loss = 1.56476935, grad/param norm = 2.5721e-02, time/batch = 0.1711s
1205/9500 (epoch 6.342), train_loss = 1.55041074, grad/param norm = 2.5041e-02, time/batch = 0.1706s
1206/9500 (epoch 6.347), train_loss = 1.55436800, grad/param norm = 2.5340e-02, time/batch = 0.1707s
1207/9500 (epoch 6.353), train_loss = 1.58356120, grad/param norm = 2.7574e-02, time/batch = 0.1712s
1208/9500 (epoch 6.358), train_loss = 1.54088088, grad/param norm = 2.5341e-02, time/batch = 0.1709s
1209/9500 (epoch 6.363), train_loss = 1.52250770, grad/param norm = 2.6208e-02, time/batch = 0.1708s
1210/9500 (epoch 6.368), train_loss = 1.55153673, grad/param norm = 2.9915e-02, time/batch = 0.1713s
1211/9500 (epoch 6.374), train_loss = 1.56547657, grad/param norm = 3.1736e-02, time/batch = 0.1680s
1212/9500 (epoch 6.379), train_loss = 1.58293375, grad/param norm = 3.1457e-02, time/batch = 0.1714s
1213/9500 (epoch 6.384), train_loss = 1.55846135, grad/param norm = 2.8564e-02, time/batch = 0.1710s
1214/9500 (epoch 6.389), train_loss = 1.57061131, grad/param norm = 2.6531e-02, time/batch = 0.1718s
1215/9500 (epoch 6.395), train_loss = 1.55018004, grad/param norm = 2.4736e-02, time/batch = 0.1712s
1216/9500 (epoch 6.400), train_loss = 1.59280874, grad/param norm = 2.4181e-02, time/batch = 0.1709s
1217/9500 (epoch 6.405), train_loss = 1.58309348, grad/param norm = 2.4618e-02, time/batch = 0.1712s
1218/9500 (epoch 6.411), train_loss = 1.57834624, grad/param norm = 2.3766e-02, time/batch = 0.1710s
1219/9500 (epoch 6.416), train_loss = 1.56666605, grad/param norm = 2.4147e-02, time/batch = 0.1710s
1220/9500 (epoch 6.421), train_loss = 1.59191423, grad/param norm = 2.3964e-02, time/batch = 0.1717s
1221/9500 (epoch 6.426), train_loss = 1.55453262, grad/param norm = 2.4590e-02, time/batch = 0.1679s
1222/9500 (epoch 6.432), train_loss = 1.54508754, grad/param norm = 2.6057e-02, time/batch = 0.1708s
1223/9500 (epoch 6.437), train_loss = 1.57658995, grad/param norm = 2.6704e-02, time/batch = 0.1721s
1224/9500 (epoch 6.442), train_loss = 1.59198083, grad/param norm = 2.6873e-02, time/batch = 0.1710s
1225/9500 (epoch 6.447), train_loss = 1.55912389, grad/param norm = 2.6679e-02, time/batch = 0.1711s
1226/9500 (epoch 6.453), train_loss = 1.58403724, grad/param norm = 2.6577e-02, time/batch = 0.1713s
1227/9500 (epoch 6.458), train_loss = 1.55439860, grad/param norm = 2.5479e-02, time/batch = 0.1708s
1228/9500 (epoch 6.463), train_loss = 1.58069977, grad/param norm = 2.5863e-02, time/batch = 0.1715s
1229/9500 (epoch 6.468), train_loss = 1.56863065, grad/param norm = 2.6022e-02, time/batch = 0.1710s
1230/9500 (epoch 6.474), train_loss = 1.56931305, grad/param norm = 2.6948e-02, time/batch = 0.1709s
1231/9500 (epoch 6.479), train_loss = 1.55011195, grad/param norm = 2.6195e-02, time/batch = 0.1678s
1232/9500 (epoch 6.484), train_loss = 1.57551002, grad/param norm = 2.6773e-02, time/batch = 0.1718s
1233/9500 (epoch 6.489), train_loss = 1.58969910, grad/param norm = 2.5897e-02, time/batch = 0.1710s
1234/9500 (epoch 6.495), train_loss = 1.58992211, grad/param norm = 2.6129e-02, time/batch = 0.1711s
1235/9500 (epoch 6.500), train_loss = 1.55782776, grad/param norm = 2.4416e-02, time/batch = 0.1713s
1236/9500 (epoch 6.505), train_loss = 1.54011432, grad/param norm = 2.5565e-02, time/batch = 0.1716s
1237/9500 (epoch 6.511), train_loss = 1.58537027, grad/param norm = 2.5914e-02, time/batch = 0.1713s
1238/9500 (epoch 6.516), train_loss = 1.56039585, grad/param norm = 2.4451e-02, time/batch = 0.1709s
1239/9500 (epoch 6.521), train_loss = 1.56065989, grad/param norm = 2.3829e-02, time/batch = 0.1716s
1240/9500 (epoch 6.526), train_loss = 1.55923576, grad/param norm = 2.2959e-02, time/batch = 0.1714s
1241/9500 (epoch 6.532), train_loss = 1.59549124, grad/param norm = 2.4361e-02, time/batch = 0.1679s
1242/9500 (epoch 6.537), train_loss = 1.58152433, grad/param norm = 2.5870e-02, time/batch = 0.1721s
1243/9500 (epoch 6.542), train_loss = 1.55773341, grad/param norm = 2.4856e-02, time/batch = 0.1710s
1244/9500 (epoch 6.547), train_loss = 1.57046490, grad/param norm = 2.4631e-02, time/batch = 0.1713s
1245/9500 (epoch 6.553), train_loss = 1.53541806, grad/param norm = 2.4299e-02, time/batch = 0.1721s
1246/9500 (epoch 6.558), train_loss = 1.54894270, grad/param norm = 2.3665e-02, time/batch = 0.1711s
1247/9500 (epoch 6.563), train_loss = 1.53760676, grad/param norm = 2.3848e-02, time/batch = 0.1710s
1248/9500 (epoch 6.568), train_loss = 1.55027062, grad/param norm = 2.4383e-02, time/batch = 0.1714s
1249/9500 (epoch 6.574), train_loss = nan, grad/param norm = 4.4345e+01, time/batch = 0.1712s
loss is NaN. This usually indicates a bug. Please check the issues page for existing issues, or create a new issue, if none exist. Ideally, please state: your operating system, 32-bit/64-bit, your blas version, cpu/cuda/cl?