huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.01k stars 1.27k forks source link

When the loss is 0 during the dpo trainer. #1236

Closed Minami-su closed 9 months ago

Minami-su commented 10 months ago
{'loss': 0.3271, 'learning_rate': 2.8571428571428573e-06, 'rewards/chosen': 9.76550579071045, 'rewards/rejected': 5.694394111633301, 'rewards/accuracies': 1.0, 'rewards/margins': 4.071112155914307, 'logps/rejected': -222.74356079101562, 'logps/chosen': -379.7199401855469, 'logits/rejected': -2.2036616802215576, 'logits/chosen': -1.4043893814086914, 'epoch': 0.0}
{'loss': 4.337, 'learning_rate': 5.7142857142857145e-06, 'rewards/chosen': 10.37876033782959, 'rewards/rejected': 14.287623405456543, 'rewards/accuracies': 0.5, 'rewards/margins': -3.908863067626953, 'logps/rejected': -324.311279296875, 'logps/chosen': -444.46240234375, 'logits/rejected': -2.572854995727539, 'logits/chosen': -2.4661190509796143, 'epoch': 0.01}
{'loss': 3.6449, 'learning_rate': 8.571428571428573e-06, 'rewards/chosen': 8.665023803710938, 'rewards/rejected': 12.245894432067871, 'rewards/accuracies': 0.0, 'rewards/margins': -3.5808703899383545, 'logps/rejected': -308.5410461425781, 'logps/chosen': -502.5372619628906, 'logits/rejected': -2.6817407608032227, 'logits/chosen': -2.560394763946533, 'epoch': 0.01}
{'loss': 0.0656, 'learning_rate': 1.1428571428571429e-05, 'rewards/chosen': 9.962109565734863, 'rewards/rejected': 5.9517951011657715, 'rewards/accuracies': 1.0, 'rewards/margins': 4.010313987731934, 'logps/rejected': -259.4195556640625, 'logps/chosen': -387.62890625, 'logits/rejected': -2.6549124717712402, 'logits/chosen': -2.5195915699005127, 'epoch': 0.01}
{'loss': 1.7342, 'learning_rate': 1.4285714285714285e-05, 'rewards/chosen': 5.09570837020874, 'rewards/rejected': 6.3042988777160645, 'rewards/accuracies': 0.5, 'rewards/margins': -1.2085902690887451, 'logps/rejected': -352.33203125, 'logps/chosen': -536.0429077148438, 'logits/rejected': -2.5064265727996826, 'logits/chosen': -2.378230571746826, 'epoch': 0.01}
{'loss': 0.6792, 'learning_rate': 1.7142857142857145e-05, 'rewards/chosen': 9.435712814331055, 'rewards/rejected': 7.611403942108154, 'rewards/accuracies': 0.5, 'rewards/margins': 1.8243086338043213, 'logps/rejected': -268.1984558105469, 'logps/chosen': -379.20538330078125, 'logits/rejected': -2.601071357727051, 'logits/chosen': -2.4120986461639404, 'epoch': 0.02}
{'loss': 2.7966, 'learning_rate': 2e-05, 'rewards/chosen': 9.528672218322754, 'rewards/rejected': 12.068880081176758, 'rewards/accuracies': 0.0, 'rewards/margins': -2.540207862854004, 'logps/rejected': -406.18621826171875, 'logps/chosen': -485.3382873535156, 'logits/rejected': -2.6609809398651123, 'logits/chosen': -2.5862650871276855, 'epoch': 0.02}
{'loss': 1.5068, 'learning_rate': 2.2857142857142858e-05, 'rewards/chosen': 11.086642265319824, 'rewards/rejected': 11.74729061126709, 'rewards/accuracies': 0.5, 'rewards/margins': -0.6606481075286865, 'logps/rejected': -420.277099609375, 'logps/chosen': -536.758544921875, 'logits/rejected': -2.528085708618164, 'logits/chosen': -2.6056628227233887, 'epoch': 0.02}
{'loss': 0.6193, 'learning_rate': 2.5714285714285714e-05, 'rewards/chosen': 5.4949235916137695, 'rewards/rejected': 3.6014909744262695, 'rewards/accuracies': 0.5, 'rewards/margins': 1.8934327363967896, 'logps/rejected': -262.360107421875, 'logps/chosen': -317.42578125, 'logits/rejected': -2.3279213905334473, 'logits/chosen': -2.267106533050537, 'epoch': 0.03}
{'loss': 3.1323, 'learning_rate': 2.857142857142857e-05, 'rewards/chosen': 9.998563766479492, 'rewards/rejected': 12.440605163574219, 'rewards/accuracies': 0.5, 'rewards/margins': -2.4420409202575684, 'logps/rejected': -381.46893310546875, 'logps/chosen': -435.32684326171875, 'logits/rejected': -2.5734753608703613, 'logits/chosen': -2.567666530609131, 'epoch': 0.03}
{'loss': 2.269, 'learning_rate': 3.142857142857143e-05, 'rewards/chosen': 2.9417946338653564, 'rewards/rejected': 5.100345611572266, 'rewards/accuracies': 0.0, 'rewards/margins': -2.158550977706909, 'logps/rejected': -219.24655151367188, 'logps/chosen': -232.14456176757812, 'logits/rejected': -2.4342288970947266, 'logits/chosen': -2.5364158153533936, 'epoch': 0.03}
{'loss': 4.1993, 'learning_rate': 3.428571428571429e-05, 'rewards/chosen': 3.1465797424316406, 'rewards/rejected': 7.279343605041504, 'rewards/accuracies': 0.0, 'rewards/margins': -4.132763862609863, 'logps/rejected': -191.58157348632812, 'logps/chosen': -207.28421020507812, 'logits/rejected': -1.7480744123458862, 'logits/chosen': -2.2305498123168945, 'epoch': 0.03}
{'loss': 0.6287, 'learning_rate': 3.7142857142857143e-05, 'rewards/chosen': 5.093177318572998, 'rewards/rejected': 3.0099830627441406, 'rewards/accuracies': 0.5, 'rewards/margins': 2.0831940174102783, 'logps/rejected': -239.65017700195312, 'logps/chosen': -238.9432373046875, 'logits/rejected': -2.543966293334961, 'logits/chosen': -2.6785264015197754, 'epoch': 0.04}
{'loss': 2.9821, 'learning_rate': 4e-05, 'rewards/chosen': 18.29865074157715, 'rewards/rejected': 21.224306106567383, 'rewards/accuracies': 0.0, 'rewards/margins': -2.925654888153076, 'logps/rejected': -566.0069580078125, 'logps/chosen': -1041.013427734375, 'logits/rejected': -2.271622896194458, 'logits/chosen': -1.9407470226287842, 'epoch': 0.04}
{'loss': 1.1113, 'learning_rate': 4.2857142857142856e-05, 'rewards/chosen': 8.190760612487793, 'rewards/rejected': 7.401647567749023, 'rewards/accuracies': 0.5, 'rewards/margins': 0.7891130447387695, 'logps/rejected': -203.2960205078125, 'logps/chosen': -248.46739196777344, 'logits/rejected': -2.9921154975891113, 'logits/chosen': -2.8795385360717773, 'epoch': 0.04}
{'loss': 0.7147, 'learning_rate': 4.5714285714285716e-05, 'rewards/chosen': 6.4853363037109375, 'rewards/rejected': 6.512063980102539, 'rewards/accuracies': 0.5, 'rewards/margins': -0.02672719955444336, 'logps/rejected': -286.2543640136719, 'logps/chosen': -271.0216369628906, 'logits/rejected': -2.9419686794281006, 'logits/chosen': -2.830944776535034, 'epoch': 0.05}
{'loss': 0.007, 'learning_rate': 4.8571428571428576e-05, 'rewards/chosen': 11.858444213867188, 'rewards/rejected': 6.615283012390137, 'rewards/accuracies': 1.0, 'rewards/margins': 5.243161201477051, 'logps/rejected': -400.72216796875, 'logps/chosen': -643.16552734375, 'logits/rejected': -2.5607786178588867, 'logits/chosen': -2.370227336883545, 'epoch': 0.05}
{'loss': 4.6525, 'learning_rate': 5.142857142857143e-05, 'rewards/chosen': 10.521270751953125, 'rewards/rejected': 14.831465721130371, 'rewards/accuracies': 0.0, 'rewards/margins': -4.310194492340088, 'logps/rejected': -355.18536376953125, 'logps/chosen': -424.28729248046875, 'logits/rejected': -1.7376881837844849, 'logits/chosen': -2.2986490726470947, 'epoch': 0.05}
{'loss': 0.3959, 'learning_rate': 5.428571428571428e-05, 'rewards/chosen': 8.702041625976562, 'rewards/rejected': 6.573768615722656, 'rewards/accuracies': 0.5, 'rewards/margins': 2.128272771835327, 'logps/rejected': -214.32481384277344, 'logps/chosen': -419.6670837402344, 'logits/rejected': -2.9277946949005127, 'logits/chosen': -2.489633798599243, 'epoch': 0.05}
{'loss': 0.1789, 'learning_rate': 5.714285714285714e-05, 'rewards/chosen': 13.2500638961792, 'rewards/rejected': 11.617621421813965, 'rewards/accuracies': 1.0, 'rewards/margins': 1.6324424743652344, 'logps/rejected': -409.19879150390625, 'logps/chosen': -585.9993896484375, 'logits/rejected': -2.337674379348755, 'logits/chosen': -2.106407403945923, 'epoch': 0.06}
{'loss': 1.285, 'learning_rate': 6e-05, 'rewards/chosen': 3.878985643386841, 'rewards/rejected': 4.549983501434326, 'rewards/accuracies': 0.5, 'rewards/margins': -0.6709977388381958, 'logps/rejected': -163.6251678466797, 'logps/chosen': -205.08514404296875, 'logits/rejected': -2.3935039043426514, 'logits/chosen': -2.5867364406585693, 'epoch': 0.06}
{'loss': 0.7075, 'learning_rate': 6.285714285714286e-05, 'rewards/chosen': 7.318753242492676, 'rewards/rejected': 7.251966953277588, 'rewards/accuracies': 0.5, 'rewards/margins': 0.06678664684295654, 'logps/rejected': -330.2303466796875, 'logps/chosen': -401.4999694824219, 'logits/rejected': -2.40046763420105, 'logits/chosen': -2.300901412963867, 'epoch': 0.06}
{'loss': 1.8219, 'learning_rate': 6.571428571428571e-05, 'rewards/chosen': 5.848296165466309, 'rewards/rejected': 5.868965148925781, 'rewards/accuracies': 0.5, 'rewards/margins': -0.020668745040893555, 'logps/rejected': -218.5603485107422, 'logps/chosen': -213.76702880859375, 'logits/rejected': -2.3953349590301514, 'logits/chosen': -2.6317856311798096, 'epoch': 0.07}
{'loss': 0.2256, 'learning_rate': 6.857142857142858e-05, 'rewards/chosen': 12.258187294006348, 'rewards/rejected': 10.34899616241455, 'rewards/accuracies': 1.0, 'rewards/margins': 1.9091908931732178, 'logps/rejected': -286.4475402832031, 'logps/chosen': -384.1056213378906, 'logits/rejected': -2.853313684463501, 'logits/chosen': -2.73045015335083, 'epoch': 0.07}
{'loss': 0.0886, 'learning_rate': 7.142857142857143e-05, 'rewards/chosen': 15.036487579345703, 'rewards/rejected': 11.12469482421875, 'rewards/accuracies': 1.0, 'rewards/margins': 3.911792039871216, 'logps/rejected': -359.0030517578125, 'logps/chosen': -595.6351318359375, 'logits/rejected': -2.4213926792144775, 'logits/chosen': -1.9984757900238037, 'epoch': 0.07}
{'loss': 0.1488, 'learning_rate': 7.428571428571429e-05, 'rewards/chosen': 5.703810214996338, 'rewards/rejected': 3.6077332496643066, 'rewards/accuracies': 1.0, 'rewards/margins': 2.0960769653320312, 'logps/rejected': -170.23516845703125, 'logps/chosen': -205.96189880371094, 'logits/rejected': -2.23213791847229, 'logits/chosen': -2.1666438579559326, 'epoch': 0.07}
{'loss': 0.1593, 'learning_rate': 7.714285714285715e-05, 'rewards/chosen': 9.102004051208496, 'rewards/rejected': 7.335665702819824, 'rewards/accuracies': 1.0, 'rewards/margins': 1.7663383483886719, 'logps/rejected': -240.70584106445312, 'logps/chosen': -452.22998046875, 'logits/rejected': -2.70328950881958, 'logits/chosen': -2.2396998405456543, 'epoch': 0.08}
{'loss': 5.1294, 'learning_rate': 8e-05, 'rewards/chosen': 7.965903282165527, 'rewards/rejected': 11.899565696716309, 'rewards/accuracies': 0.5, 'rewards/margins': -3.9336628913879395, 'logps/rejected': -329.56683349609375, 'logps/chosen': -415.9659729003906, 'logits/rejected': -2.6231846809387207, 'logits/chosen': -2.4714009761810303, 'epoch': 0.08}
{'loss': 0.2517, 'learning_rate': 8.285714285714287e-05, 'rewards/chosen': 15.831401824951172, 'rewards/rejected': 8.378949165344238, 'rewards/accuracies': 1.0, 'rewards/margins': 7.452452659606934, 'logps/rejected': -330.96051025390625, 'logps/chosen': -413.93597412109375, 'logits/rejected': -2.522907257080078, 'logits/chosen': -2.6029484272003174, 'epoch': 0.08}
{'loss': 1.1566, 'learning_rate': 8.571428571428571e-05, 'rewards/chosen': 6.751397132873535, 'rewards/rejected': 5.360540866851807, 'rewards/accuracies': 0.5, 'rewards/margins': 1.3908562660217285, 'logps/rejected': -195.64459228515625, 'logps/chosen': -180.86102294921875, 'logits/rejected': -2.359625816345215, 'logits/chosen': -2.726764440536499, 'epoch': 0.09}
{'loss': 0.0038, 'learning_rate': 8.857142857142857e-05, 'rewards/chosen': 11.328934669494629, 'rewards/rejected': 5.569596290588379, 'rewards/accuracies': 1.0, 'rewards/margins': 5.75933837890625, 'logps/rejected': -307.4915466308594, 'logps/chosen': -361.2106628417969, 'logits/rejected': -3.157242774963379, 'logits/chosen': -2.837930679321289, 'epoch': 0.09}
{'loss': 2.2857, 'learning_rate': 9.142857142857143e-05, 'rewards/chosen': 10.191067695617676, 'rewards/rejected': 10.912897109985352, 'rewards/accuracies': 0.5, 'rewards/margins': -0.7218294143676758, 'logps/rejected': -545.8710327148438, 'logps/chosen': -665.58935546875, 'logits/rejected': -2.3032188415527344, 'logits/chosen': -2.255991220474243, 'epoch': 0.09}
{'loss': 1.1507, 'learning_rate': 9.428571428571429e-05, 'rewards/chosen': 4.873177528381348, 'rewards/rejected': 3.9558022022247314, 'rewards/accuracies': 0.5, 'rewards/margins': 0.9173754453659058, 'logps/rejected': -270.8794860839844, 'logps/chosen': -335.8307189941406, 'logits/rejected': -2.731945037841797, 'logits/chosen': -2.534651756286621, 'epoch': 0.09}
{'loss': 2.1113, 'learning_rate': 9.714285714285715e-05, 'rewards/chosen': 11.037872314453125, 'rewards/rejected': 11.834676742553711, 'rewards/accuracies': 0.5, 'rewards/margins': -0.7968049049377441, 'logps/rejected': -375.4657287597656, 'logps/chosen': -674.8712768554688, 'logits/rejected': -2.398104190826416, 'logits/chosen': -2.0487220287323, 'epoch': 0.1}
{'loss': 0.0076, 'learning_rate': 0.0001, 'rewards/chosen': 9.013261795043945, 'rewards/rejected': 3.828528642654419, 'rewards/accuracies': 1.0, 'rewards/margins': 5.1847333908081055, 'logps/rejected': -297.1522216796875, 'logps/chosen': -286.61737060546875, 'logits/rejected': -2.5126781463623047, 'logits/chosen': -2.7587356567382812, 'epoch': 0.1}
{'loss': 3.9537, 'learning_rate': 9.968152866242038e-05, 'rewards/chosen': 6.393144607543945, 'rewards/rejected': 6.25003719329834, 'rewards/accuracies': 0.5, 'rewards/margins': 0.14310741424560547, 'logps/rejected': -283.6871337890625, 'logps/chosen': -244.9435577392578, 'logits/rejected': -2.2845373153686523, 'logits/chosen': -2.5686628818511963, 'epoch': 0.1}
{'loss': 0.0595, 'learning_rate': 9.936305732484077e-05, 'rewards/chosen': 9.288958549499512, 'rewards/rejected': 5.675914764404297, 'rewards/accuracies': 1.0, 'rewards/margins': 3.613043785095215, 'logps/rejected': -210.5533447265625, 'logps/chosen': -200.42291259765625, 'logits/rejected': -1.293943166732788, 'logits/chosen': -1.6267205476760864, 'epoch': 0.11}
{'loss': 0.0589, 'learning_rate': 9.904458598726115e-05, 'rewards/chosen': 5.602908134460449, 'rewards/rejected': 2.799147129058838, 'rewards/accuracies': 1.0, 'rewards/margins': 2.8037612438201904, 'logps/rejected': -187.88352966308594, 'logps/chosen': -170.78341674804688, 'logits/rejected': -2.072585105895996, 'logits/chosen': -1.884591817855835, 'epoch': 0.11}
{'loss': 0.9505, 'learning_rate': 9.872611464968153e-05, 'rewards/chosen': 5.654599189758301, 'rewards/rejected': 2.2882285118103027, 'rewards/accuracies': 0.5, 'rewards/margins': 3.366370677947998, 'logps/rejected': -289.8052062988281, 'logps/chosen': -241.89151000976562, 'logits/rejected': -2.1928834915161133, 'logits/chosen': -2.242314100265503, 'epoch': 0.11}
{'loss': 0.9561, 'learning_rate': 9.840764331210192e-05, 'rewards/chosen': 2.716583490371704, 'rewards/rejected': 2.9357361793518066, 'rewards/accuracies': 0.5, 'rewards/margins': -0.21915274858474731, 'logps/rejected': -339.14263916015625, 'logps/chosen': -376.89666748046875, 'logits/rejected': -2.4038279056549072, 'logits/chosen': -2.3232498168945312, 'epoch': 0.11}
{'loss': 0.785, 'learning_rate': 9.80891719745223e-05, 'rewards/chosen': 7.414801597595215, 'rewards/rejected': 3.539991855621338, 'rewards/accuracies': 0.5, 'rewards/margins': 3.874809741973877, 'logps/rejected': -372.60009765625, 'logps/chosen': -304.47698974609375, 'logits/rejected': -2.430130958557129, 'logits/chosen': -2.688509464263916, 'epoch': 0.12}
{'loss': 0.0155, 'learning_rate': 9.777070063694268e-05, 'rewards/chosen': 18.055667877197266, 'rewards/rejected': 7.982444763183594, 'rewards/accuracies': 1.0, 'rewards/margins': 10.073222160339355, 'logps/rejected': -517.175537109375, 'logps/chosen': -667.943359375, 'logits/rejected': -2.085599899291992, 'logits/chosen': -1.806464433670044, 'epoch': 0.12}
{'loss': 0.0024, 'learning_rate': 9.745222929936307e-05, 'rewards/chosen': 4.081321716308594, 'rewards/rejected': -3.570204257965088, 'rewards/accuracies': 1.0, 'rewards/margins': 7.65152645111084, 'logps/rejected': -223.76454162597656, 'logps/chosen': -354.561767578125, 'logits/rejected': -2.924449920654297, 'logits/chosen': -2.23237681388855, 'epoch': 0.12}
{'loss': 0.1841, 'learning_rate': 9.713375796178345e-05, 'rewards/chosen': 2.394066572189331, 'rewards/rejected': -2.6640756130218506, 'rewards/accuracies': 1.0, 'rewards/margins': 5.058141708374023, 'logps/rejected': -234.0782470703125, 'logps/chosen': -215.621826171875, 'logits/rejected': -2.2443792819976807, 'logits/chosen': -2.0835461616516113, 'epoch': 0.13}
{'loss': 0.2342, 'learning_rate': 9.681528662420382e-05, 'rewards/chosen': 6.764308452606201, 'rewards/rejected': 4.498918056488037, 'rewards/accuracies': 1.0, 'rewards/margins': 2.265390157699585, 'logps/rejected': -487.94830322265625, 'logps/chosen': -528.35693359375, 'logits/rejected': -2.078345537185669, 'logits/chosen': -2.233227491378784, 'epoch': 0.13}
{'loss': 0.0063, 'learning_rate': 9.649681528662421e-05, 'rewards/chosen': 6.213534832000732, 'rewards/rejected': 0.9645828008651733, 'rewards/accuracies': 1.0, 'rewards/margins': 5.2489519119262695, 'logps/rejected': -296.79168701171875, 'logps/chosen': -303.1146545410156, 'logits/rejected': -2.547497034072876, 'logits/chosen': -2.4778566360473633, 'epoch': 0.13}
{'loss': 0.0024, 'learning_rate': 9.617834394904459e-05, 'rewards/chosen': 8.117159843444824, 'rewards/rejected': -0.3035072088241577, 'rewards/accuracies': 1.0, 'rewards/margins': 8.42066764831543, 'logps/rejected': -420.0975646972656, 'logps/chosen': -411.7033996582031, 'logits/rejected': -2.2457261085510254, 'logits/chosen': -2.2690463066101074, 'epoch': 0.13}
{'loss': 0.0422, 'learning_rate': 9.585987261146497e-05, 'rewards/chosen': 4.635070323944092, 'rewards/rejected': -0.17799071967601776, 'rewards/accuracies': 1.0, 'rewards/margins': 4.813060760498047, 'logps/rejected': -359.7799072265625, 'logps/chosen': -451.5242919921875, 'logits/rejected': -2.3562259674072266, 'logits/chosen': -2.100349187850952, 'epoch': 0.14}
{'loss': 0.0005, 'learning_rate': 9.554140127388536e-05, 'rewards/chosen': 7.082529067993164, 'rewards/rejected': -1.645300269126892, 'rewards/accuracies': 1.0, 'rewards/margins': 8.727828979492188, 'logps/rejected': -567.9530029296875, 'logps/chosen': -881.6746826171875, 'logits/rejected': -2.062581777572632, 'logits/chosen': -1.5478285551071167, 'epoch': 0.14}
{'loss': 0.0, 'learning_rate': 9.522292993630574e-05, 'rewards/chosen': 4.909414291381836, 'rewards/rejected': -6.0071258544921875, 'rewards/accuracies': 1.0, 'rewards/margins': 10.916540145874023, 'logps/rejected': -278.0087585449219, 'logps/chosen': -181.53085327148438, 'logits/rejected': -3.019265651702881, 'logits/chosen': -2.816269874572754, 'epoch': 0.14}
{'loss': 0.0412, 'learning_rate': 9.490445859872612e-05, 'rewards/chosen': 1.5037834644317627, 'rewards/rejected': -4.786923408508301, 'rewards/accuracies': 1.0, 'rewards/margins': 6.290706634521484, 'logps/rejected': -414.8692321777344, 'logps/chosen': -362.024658203125, 'logits/rejected': -2.1436266899108887, 'logits/chosen': -2.0256175994873047, 'epoch': 0.15}
{'loss': 0.013, 'learning_rate': 9.458598726114651e-05, 'rewards/chosen': 1.7205756902694702, 'rewards/rejected': -6.361499309539795, 'rewards/accuracies': 1.0, 'rewards/margins': 8.082075119018555, 'logps/rejected': -277.802490234375, 'logps/chosen': -221.79425048828125, 'logits/rejected': -2.131343126296997, 'logits/chosen': -2.170713424682617, 'epoch': 0.15}
{'loss': 0.0, 'learning_rate': 9.426751592356689e-05, 'rewards/chosen': 11.538044929504395, 'rewards/rejected': 1.1099976301193237, 'rewards/accuracies': 1.0, 'rewards/margins': 10.428047180175781, 'logps/rejected': -542.4000244140625, 'logps/chosen': -589.8695678710938, 'logits/rejected': -1.9713342189788818, 'logits/chosen': -1.994497299194336, 'epoch': 0.15}
{'loss': 0.0, 'learning_rate': 9.394904458598726e-05, 'rewards/chosen': 11.402349472045898, 'rewards/rejected': -2.0969910621643066, 'rewards/accuracies': 1.0, 'rewards/margins': 13.499340057373047, 'logps/rejected': -570.9699096679688, 'logps/chosen': -647.6015014648438, 'logits/rejected': -2.510369300842285, 'logits/chosen': -2.2893569469451904, 'epoch': 0.15}
{'loss': 0.01, 'learning_rate': 9.363057324840766e-05, 'rewards/chosen': 8.363224029541016, 'rewards/rejected': -1.062709093093872, 'rewards/accuracies': 1.0, 'rewards/margins': 9.425932884216309, 'logps/rejected': -457.0020751953125, 'logps/chosen': -459.1802673339844, 'logits/rejected': -2.298748731613159, 'logits/chosen': -2.0860238075256348, 'epoch': 0.16}
{'loss': 0.0031, 'learning_rate': 9.331210191082803e-05, 'rewards/chosen': 15.096826553344727, 'rewards/rejected': 4.709814548492432, 'rewards/accuracies': 1.0, 'rewards/margins': 10.387011528015137, 'logps/rejected': -451.02685546875, 'logps/chosen': -636.78173828125, 'logits/rejected': -2.1717936992645264, 'logits/chosen': -2.0131514072418213, 'epoch': 0.16}
{'loss': 0.0, 'learning_rate': 9.299363057324841e-05, 'rewards/chosen': 6.653864860534668, 'rewards/rejected': -6.6340651512146, 'rewards/accuracies': 1.0, 'rewards/margins': 13.28792953491211, 'logps/rejected': -412.59063720703125, 'logps/chosen': -592.3363647460938, 'logits/rejected': -2.4550583362579346, 'logits/chosen': -2.0615296363830566, 'epoch': 0.16}
{'loss': 0.0086, 'learning_rate': 9.26751592356688e-05, 'rewards/chosen': 3.631244659423828, 'rewards/rejected': -2.255357265472412, 'rewards/accuracies': 1.0, 'rewards/margins': 5.88660192489624, 'logps/rejected': -362.1160888671875, 'logps/chosen': -423.75006103515625, 'logits/rejected': -2.3598544597625732, 'logits/chosen': -2.2748091220855713, 'epoch': 0.17}
{'loss': 0.0001, 'learning_rate': 9.235668789808918e-05, 'rewards/chosen': 16.103487014770508, 'rewards/rejected': 1.120680332183838, 'rewards/accuracies': 1.0, 'rewards/margins': 14.982807159423828, 'logps/rejected': -483.543212890625, 'logps/chosen': -649.7151489257812, 'logits/rejected': -2.2380475997924805, 'logits/chosen': -2.146322727203369, 'epoch': 0.17}
{'loss': 0.0006, 'learning_rate': 9.203821656050956e-05, 'rewards/chosen': 6.942976474761963, 'rewards/rejected': -2.9508578777313232, 'rewards/accuracies': 1.0, 'rewards/margins': 9.893834114074707, 'logps/rejected': -457.2585754394531, 'logps/chosen': -390.50775146484375, 'logits/rejected': -2.1680803298950195, 'logits/chosen': -2.0607855319976807, 'epoch': 0.17}
{'loss': 0.0039, 'learning_rate': 9.171974522292994e-05, 'rewards/chosen': 6.127560615539551, 'rewards/rejected': 0.3439497947692871, 'rewards/accuracies': 1.0, 'rewards/margins': 5.783610820770264, 'logps/rejected': -531.4354858398438, 'logps/chosen': -573.599365234375, 'logits/rejected': -2.354940414428711, 'logits/chosen': -2.4183404445648193, 'epoch': 0.17}
{'loss': 0.0003, 'learning_rate': 9.140127388535033e-05, 'rewards/chosen': 22.300918579101562, 'rewards/rejected': 11.639850616455078, 'rewards/accuracies': 1.0, 'rewards/margins': 10.661067962646484, 'logps/rejected': -533.4765014648438, 'logps/chosen': -533.2408447265625, 'logits/rejected': -2.128840684890747, 'logits/chosen': -2.041933536529541, 'epoch': 0.18}
{'loss': 0.0019, 'learning_rate': 9.10828025477707e-05, 'rewards/chosen': 12.095962524414062, 'rewards/rejected': 4.7656402587890625, 'rewards/accuracies': 1.0, 'rewards/margins': 7.330321788787842, 'logps/rejected': -476.2185974121094, 'logps/chosen': -574.9154052734375, 'logits/rejected': -1.5784603357315063, 'logits/chosen': -1.726388931274414, 'epoch': 0.18}
{'loss': 0.0256, 'learning_rate': 9.076433121019108e-05, 'rewards/chosen': -1.0542480945587158, 'rewards/rejected': -5.564414978027344, 'rewards/accuracies': 1.0, 'rewards/margins': 4.510167121887207, 'logps/rejected': -430.8316650390625, 'logps/chosen': -661.29248046875, 'logits/rejected': -2.0734639167785645, 'logits/chosen': -1.8012316226959229, 'epoch': 0.18}
{'loss': 0.0781, 'learning_rate': 9.044585987261147e-05, 'rewards/chosen': 11.867280960083008, 'rewards/rejected': 3.7106142044067383, 'rewards/accuracies': 1.0, 'rewards/margins': 8.156665802001953, 'logps/rejected': -587.3938598632812, 'logps/chosen': -676.8272094726562, 'logits/rejected': -2.0171477794647217, 'logits/chosen': -2.125857353210449, 'epoch': 0.19}
{'loss': 0.0, 'learning_rate': 9.012738853503185e-05, 'rewards/chosen': 6.553171157836914, 'rewards/rejected': -12.309289932250977, 'rewards/accuracies': 1.0, 'rewards/margins': 18.86246109008789, 'logps/rejected': -324.0303955078125, 'logps/chosen': -166.53079223632812, 'logits/rejected': -1.8363780975341797, 'logits/chosen': -2.2231171131134033, 'epoch': 0.19}
{'loss': 0.0146, 'learning_rate': 8.980891719745223e-05, 'rewards/chosen': -0.5789963006973267, 'rewards/rejected': -7.4599714279174805, 'rewards/accuracies': 1.0, 'rewards/margins': 6.880975246429443, 'logps/rejected': -508.4747314453125, 'logps/chosen': -480.41497802734375, 'logits/rejected': -1.9221413135528564, 'logits/chosen': -2.1281278133392334, 'epoch': 0.19}
{'loss': 0.0007, 'learning_rate': 8.949044585987262e-05, 'rewards/chosen': 13.167899131774902, 'rewards/rejected': -0.47178956866264343, 'rewards/accuracies': 1.0, 'rewards/margins': 13.639688491821289, 'logps/rejected': -602.9678955078125, 'logps/chosen': -782.071044921875, 'logits/rejected': -2.0020408630371094, 'logits/chosen': -1.7491728067398071, 'epoch': 0.19}
{'loss': 0.0, 'learning_rate': 8.9171974522293e-05, 'rewards/chosen': 5.735827922821045, 'rewards/rejected': -6.949836730957031, 'rewards/accuracies': 1.0, 'rewards/margins': 12.685664176940918, 'logps/rejected': -493.12335205078125, 'logps/chosen': -641.8917236328125, 'logits/rejected': -2.116422653198242, 'logits/chosen': -1.9829375743865967, 'epoch': 0.2}
{'loss': 0.0, 'learning_rate': 8.885350318471338e-05, 'rewards/chosen': 4.584412574768066, 'rewards/rejected': -7.9111480712890625, 'rewards/accuracies': 1.0, 'rewards/margins': 12.495560646057129, 'logps/rejected': -401.3614807128906, 'logps/chosen': -558.4058837890625, 'logits/rejected': -2.9850964546203613, 'logits/chosen': -2.1837918758392334, 'epoch': 0.2}
{'loss': 0.528, 'learning_rate': 8.853503184713377e-05, 'rewards/chosen': -5.600427150726318, 'rewards/rejected': -11.665740966796875, 'rewards/accuracies': 0.5, 'rewards/margins': 6.065313816070557, 'logps/rejected': -552.9074096679688, 'logps/chosen': -596.8792724609375, 'logits/rejected': -2.0081567764282227, 'logits/chosen': -1.8167431354522705, 'epoch': 0.2}
{'loss': 0.0004, 'learning_rate': 8.821656050955415e-05, 'rewards/chosen': 3.810685873031616, 'rewards/rejected': -6.753430366516113, 'rewards/accuracies': 1.0, 'rewards/margins': 10.564115524291992, 'logps/rejected': -487.8468017578125, 'logps/chosen': -700.0806274414062, 'logits/rejected': -2.4682936668395996, 'logits/chosen': -2.047358989715576, 'epoch': 0.21}
{'loss': 0.0, 'learning_rate': 8.789808917197452e-05, 'rewards/chosen': 0.28495633602142334, 'rewards/rejected': -10.999829292297363, 'rewards/accuracies': 1.0, 'rewards/margins': 11.284786224365234, 'logps/rejected': -409.185791015625, 'logps/chosen': -375.52545166015625, 'logits/rejected': -2.489328384399414, 'logits/chosen': -2.410038471221924, 'epoch': 0.21}
{'loss': 0.4477, 'learning_rate': 8.757961783439491e-05, 'rewards/chosen': -0.2365570068359375, 'rewards/rejected': -4.436148166656494, 'rewards/accuracies': 0.5, 'rewards/margins': 4.199591636657715, 'logps/rejected': -481.8614807128906, 'logps/chosen': -441.8030700683594, 'logits/rejected': -2.3762691020965576, 'logits/chosen': -2.470979928970337, 'epoch': 0.21}
{'loss': 0.0008, 'learning_rate': 8.726114649681529e-05, 'rewards/chosen': 7.0157880783081055, 'rewards/rejected': -5.299032688140869, 'rewards/accuracies': 1.0, 'rewards/margins': 12.314821243286133, 'logps/rejected': -651.0528564453125, 'logps/chosen': -501.09210205078125, 'logits/rejected': -2.078284502029419, 'logits/chosen': -2.3992843627929688, 'epoch': 0.21}
{'loss': 0.0002, 'learning_rate': 8.694267515923567e-05, 'rewards/chosen': 6.436542987823486, 'rewards/rejected': -3.5933258533477783, 'rewards/accuracies': 1.0, 'rewards/margins': 10.029869079589844, 'logps/rejected': -434.1207580566406, 'logps/chosen': -714.4470825195312, 'logits/rejected': -2.5401058197021484, 'logits/chosen': -1.9881186485290527, 'epoch': 0.22}
{'loss': 0.1231, 'learning_rate': 8.662420382165606e-05, 'rewards/chosen': -4.2540483474731445, 'rewards/rejected': -10.586475372314453, 'rewards/accuracies': 1.0, 'rewards/margins': 6.33242654800415, 'logps/rejected': -340.73974609375, 'logps/chosen': -339.41546630859375, 'logits/rejected': -2.0672171115875244, 'logits/chosen': -1.9914801120758057, 'epoch': 0.22}
{'loss': 0.0173, 'learning_rate': 8.630573248407644e-05, 'rewards/chosen': 7.91414213180542, 'rewards/rejected': -1.9467177391052246, 'rewards/accuracies': 1.0, 'rewards/margins': 9.860860824584961, 'logps/rejected': -543.6546630859375, 'logps/chosen': -502.35858154296875, 'logits/rejected': -2.121232032775879, 'logits/chosen': -2.203777551651001, 'epoch': 0.22}
{'loss': 0.0001, 'learning_rate': 8.598726114649682e-05, 'rewards/chosen': 2.599613904953003, 'rewards/rejected': -10.087872505187988, 'rewards/accuracies': 1.0, 'rewards/margins': 12.68748664855957, 'logps/rejected': -546.5037231445312, 'logps/chosen': -563.8788452148438, 'logits/rejected': -2.19246244430542, 'logits/chosen': -2.1131720542907715, 'epoch': 0.23}
{'loss': 0.2116, 'learning_rate': 8.566878980891721e-05, 'rewards/chosen': -2.047865629196167, 'rewards/rejected': -10.620462417602539, 'rewards/accuracies': 1.0, 'rewards/margins': 8.572596549987793, 'logps/rejected': -394.2046203613281, 'logps/chosen': -502.3536376953125, 'logits/rejected': -2.7269034385681152, 'logits/chosen': -2.26883864402771, 'epoch': 0.23}
{'loss': 0.0003, 'learning_rate': 8.535031847133759e-05, 'rewards/chosen': 6.190768241882324, 'rewards/rejected': -3.404986619949341, 'rewards/accuracies': 1.0, 'rewards/margins': 9.595754623413086, 'logps/rejected': -641.7998657226562, 'logps/chosen': -702.34228515625, 'logits/rejected': -2.201833486557007, 'logits/chosen': -2.1078579425811768, 'epoch': 0.23}
{'loss': 0.0, 'learning_rate': 8.503184713375796e-05, 'rewards/chosen': 8.708703994750977, 'rewards/rejected': -3.3944473266601562, 'rewards/accuracies': 1.0, 'rewards/margins': 12.103151321411133, 'logps/rejected': -531.4444580078125, 'logps/chosen': -559.5379638671875, 'logits/rejected': -2.1188368797302246, 'logits/chosen': -2.153688669204712, 'epoch': 0.23}
{'loss': 0.003, 'learning_rate': 8.471337579617836e-05, 'rewards/chosen': -3.5540595054626465, 'rewards/rejected': -9.3986177444458, 'rewards/accuracies': 1.0, 'rewards/margins': 5.844558238983154, 'logps/rejected': -287.1111755371094, 'logps/chosen': -274.91558837890625, 'logits/rejected': -1.7256306409835815, 'logits/chosen': -1.641602873802185, 'epoch': 0.24}
{'loss': 0.0, 'learning_rate': 8.439490445859873e-05, 'rewards/chosen': 8.544201850891113, 'rewards/rejected': -12.9879150390625, 'rewards/accuracies': 1.0, 'rewards/margins': 21.532115936279297, 'logps/rejected': -756.504150390625, 'logps/chosen': -979.8079833984375, 'logits/rejected': -2.136788845062256, 'logits/chosen': -1.8552265167236328, 'epoch': 0.24}
{'loss': 0.0, 'learning_rate': 8.407643312101911e-05, 'rewards/chosen': -3.2421600818634033, 'rewards/rejected': -17.261249542236328, 'rewards/accuracies': 1.0, 'rewards/margins': 14.019089698791504, 'logps/rejected': -578.3624877929688, 'logps/chosen': -619.171630859375, 'logits/rejected': -2.331512451171875, 'logits/chosen': -2.270779609680176, 'epoch': 0.24}
{'loss': 0.1948, 'learning_rate': 8.37579617834395e-05, 'rewards/chosen': -6.119821071624756, 'rewards/rejected': -13.213452339172363, 'rewards/accuracies': 1.0, 'rewards/margins': 7.093631267547607, 'logps/rejected': -445.884521484375, 'logps/chosen': -588.0732421875, 'logits/rejected': -2.5010712146759033, 'logits/chosen': -2.0529234409332275, 'epoch': 0.25}
{'loss': 0.0, 'learning_rate': 8.343949044585988e-05, 'rewards/chosen': 7.285365104675293, 'rewards/rejected': -5.632093906402588, 'rewards/accuracies': 1.0, 'rewards/margins': 12.917459487915039, 'logps/rejected': -579.6959228515625, 'logps/chosen': -611.3963623046875, 'logits/rejected': -2.2976810932159424, 'logits/chosen': -2.06174635887146, 'epoch': 0.25}
{'loss': 0.0001, 'learning_rate': 8.312101910828026e-05, 'rewards/chosen': 12.761240005493164, 'rewards/rejected': -1.6446411609649658, 'rewards/accuracies': 1.0, 'rewards/margins': 14.40588092803955, 'logps/rejected': -792.9464111328125, 'logps/chosen': -1150.3875732421875, 'logits/rejected': -2.0240719318389893, 'logits/chosen': -1.4666099548339844, 'epoch': 0.25}
{'loss': 0.0202, 'learning_rate': 8.280254777070065e-05, 'rewards/chosen': -0.9631209969520569, 'rewards/rejected': -7.378716945648193, 'rewards/accuracies': 1.0, 'rewards/margins': 6.415596008300781, 'logps/rejected': -310.78717041015625, 'logps/chosen': -272.5062255859375, 'logits/rejected': -2.5301146507263184, 'logits/chosen': -2.464554786682129, 'epoch': 0.26}
{'loss': 0.0031, 'learning_rate': 8.248407643312103e-05, 'rewards/chosen': -0.17886817455291748, 'rewards/rejected': -8.789806365966797, 'rewards/accuracies': 1.0, 'rewards/margins': 8.61093807220459, 'logps/rejected': -344.0855712890625, 'logps/chosen': -468.7261962890625, 'logits/rejected': -2.5983314514160156, 'logits/chosen': -2.0826327800750732, 'epoch': 0.26}
{'loss': 0.0037, 'learning_rate': 8.21656050955414e-05, 'rewards/chosen': 3.2668991088867188, 'rewards/rejected': -2.663745403289795, 'rewards/accuracies': 1.0, 'rewards/margins': 5.930644512176514, 'logps/rejected': -669.637451171875, 'logps/chosen': -656.3309936523438, 'logits/rejected': -1.9372867345809937, 'logits/chosen': -1.9726930856704712, 'epoch': 0.26}
{'loss': 0.0018, 'learning_rate': 8.18471337579618e-05, 'rewards/chosen': 4.168943405151367, 'rewards/rejected': -7.698178291320801, 'rewards/accuracies': 1.0, 'rewards/margins': 11.867120742797852, 'logps/rejected': -382.6067810058594, 'logps/chosen': -430.3105773925781, 'logits/rejected': -2.2398533821105957, 'logits/chosen': -2.079040288925171, 'epoch': 0.26}
{'loss': 0.0, 'learning_rate': 8.152866242038217e-05, 'rewards/chosen': 3.771867513656616, 'rewards/rejected': -8.319348335266113, 'rewards/accuracies': 1.0, 'rewards/margins': 12.091215133666992, 'logps/rejected': -521.0684814453125, 'logps/chosen': -621.0313110351562, 'logits/rejected': -2.348339557647705, 'logits/chosen': -2.127917528152466, 'epoch': 0.27}
{'loss': 0.0, 'learning_rate': 8.121019108280255e-05, 'rewards/chosen': 8.695867538452148, 'rewards/rejected': -5.914694309234619, 'rewards/accuracies': 1.0, 'rewards/margins': 14.610562324523926, 'logps/rejected': -334.9594421386719, 'logps/chosen': -272.35382080078125, 'logits/rejected': -2.170018196105957, 'logits/chosen': -1.22844398021698, 'epoch': 0.27}
{'loss': 0.0015, 'learning_rate': 8.089171974522294e-05, 'rewards/chosen': 4.891061782836914, 'rewards/rejected': -7.2648420333862305, 'rewards/accuracies': 1.0, 'rewards/margins': 12.155903816223145, 'logps/rejected': -568.0234375, 'logps/chosen': -620.33935546875, 'logits/rejected': -2.222231149673462, 'logits/chosen': -1.991237998008728, 'epoch': 0.27}
{'loss': 0.0092, 'learning_rate': 8.057324840764332e-05, 'rewards/chosen': -0.7307358384132385, 'rewards/rejected': -7.670431613922119, 'rewards/accuracies': 1.0, 'rewards/margins': 6.939695835113525, 'logps/rejected': -410.9543151855469, 'logps/chosen': -307.369873046875, 'logits/rejected': -1.5965020656585693, 'logits/chosen': -2.182215929031372, 'epoch': 0.28}
{'loss': 0.0, 'learning_rate': 8.02547770700637e-05, 'rewards/chosen': 8.617658615112305, 'rewards/rejected': -1.6654160022735596, 'rewards/accuracies': 1.0, 'rewards/margins': 10.283075332641602, 'logps/rejected': -417.6541748046875, 'logps/chosen': -344.82342529296875, 'logits/rejected': -1.442132830619812, 'logits/chosen': -1.3633073568344116, 'epoch': 0.28}
{'loss': 0.0018, 'learning_rate': 7.993630573248409e-05, 'rewards/chosen': -2.0329437255859375, 'rewards/rejected': -8.888056755065918, 'rewards/accuracies': 1.0, 'rewards/margins': 6.8551130294799805, 'logps/rejected': -525.4430541992188, 'logps/chosen': -723.8294677734375, 'logits/rejected': -2.1427061557769775, 'logits/chosen': -1.9407159090042114, 'epoch': 0.28}
{'loss': 0.0001, 'learning_rate': 7.961783439490447e-05, 'rewards/chosen': 7.835193634033203, 'rewards/rejected': -4.788774490356445, 'rewards/accuracies': 1.0, 'rewards/margins': 12.623968124389648, 'logps/rejected': -533.3877563476562, 'logps/chosen': -486.5855712890625, 'logits/rejected': -2.3302197456359863, 'logits/chosen': -2.541356325149536, 'epoch': 0.28}
{'loss': 0.0, 'learning_rate': 7.929936305732485e-05, 'rewards/chosen': 6.835894584655762, 'rewards/rejected': -4.862768650054932, 'rewards/accuracies': 1.0, 'rewards/margins': 11.698663711547852, 'logps/rejected': -327.190185546875, 'logps/chosen': -230.20355224609375, 'logits/rejected': -1.2661499977111816, 'logits/chosen': -1.382659673690796, 'epoch': 0.29}
{'loss': 0.0009, 'learning_rate': 7.898089171974524e-05, 'rewards/chosen': 1.7465431690216064, 'rewards/rejected': -8.289178848266602, 'rewards/accuracies': 1.0, 'rewards/margins': 10.035721778869629, 'logps/rejected': -313.01678466796875, 'logps/chosen': -207.72207641601562, 'logits/rejected': -2.3031275272369385, 'logits/chosen': -2.726222515106201, 'epoch': 0.29}
{'loss': 0.0, 'learning_rate': 7.866242038216561e-05, 'rewards/chosen': 4.516917705535889, 'rewards/rejected': -12.495894432067871, 'rewards/accuracies': 1.0, 'rewards/margins': 17.0128116607666, 'logps/rejected': -598.1464233398438, 'logps/chosen': -206.33082580566406, 'logits/rejected': -2.321200132369995, 'logits/chosen': -2.9600765705108643, 'epoch': 0.29}
{'loss': 0.0001, 'learning_rate': 7.834394904458599e-05, 'rewards/chosen': 1.2557830810546875, 'rewards/rejected': -8.264877319335938, 'rewards/accuracies': 1.0, 'rewards/margins': 9.520660400390625, 'logps/rejected': -379.0237731933594, 'logps/chosen': -240.19216918945312, 'logits/rejected': -1.9889600276947021, 'logits/chosen': -2.5409698486328125, 'epoch': 0.3}
{'loss': 0.0003, 'learning_rate': 7.802547770700638e-05, 'rewards/chosen': 0.5373611450195312, 'rewards/rejected': -8.880823135375977, 'rewards/accuracies': 1.0, 'rewards/margins': 9.418184280395508, 'logps/rejected': -327.1832275390625, 'logps/chosen': -234.3138885498047, 'logits/rejected': -2.040424108505249, 'logits/chosen': -2.3902175426483154, 'epoch': 0.3}
{'loss': 0.0, 'learning_rate': 7.770700636942676e-05, 'rewards/chosen': 13.523938179016113, 'rewards/rejected': -5.272146701812744, 'rewards/accuracies': 1.0, 'rewards/margins': 18.796085357666016, 'logps/rejected': -553.846435546875, 'logps/chosen': -711.7606201171875, 'logits/rejected': -2.0233030319213867, 'logits/chosen': -1.7807636260986328, 'epoch': 0.3}
{'loss': 0.0001, 'learning_rate': 7.738853503184714e-05, 'rewards/chosen': 4.248553276062012, 'rewards/rejected': -7.276942253112793, 'rewards/accuracies': 1.0, 'rewards/margins': 11.525495529174805, 'logps/rejected': -310.0194091796875, 'logps/chosen': -225.13946533203125, 'logits/rejected': -2.6456780433654785, 'logits/chosen': -2.8604886531829834, 'epoch': 0.3}
{'loss': 0.0057, 'learning_rate': 7.707006369426753e-05, 'rewards/chosen': 2.013974189758301, 'rewards/rejected': -3.6901824474334717, 'rewards/accuracies': 1.0, 'rewards/margins': 5.704156875610352, 'logps/rejected': -611.52685546875, 'logps/chosen': -615.7352294921875, 'logits/rejected': -2.423436164855957, 'logits/chosen': -2.4863440990448, 'epoch': 0.31}
{'loss': 0.0, 'learning_rate': 7.675159235668791e-05, 'rewards/chosen': 9.995869636535645, 'rewards/rejected': -1.9442719221115112, 'rewards/accuracies': 1.0, 'rewards/margins': 11.940141677856445, 'logps/rejected': -564.6302490234375, 'logps/chosen': -534.4788208007812, 'logits/rejected': -2.1691925525665283, 'logits/chosen': -2.3112056255340576, 'epoch': 0.31}
{'loss': 0.0, 'learning_rate': 7.643312101910829e-05, 'rewards/chosen': 15.142499923706055, 'rewards/rejected': -4.615649223327637, 'rewards/accuracies': 1.0, 'rewards/margins': 19.758148193359375, 'logps/rejected': -557.781494140625, 'logps/chosen': -828.5750122070312, 'logits/rejected': -2.400648355484009, 'logits/chosen': -1.7465407848358154, 'epoch': 0.31}
{'loss': 0.0, 'learning_rate': 7.611464968152868e-05, 'rewards/chosen': 6.6635284423828125, 'rewards/rejected': -10.618587493896484, 'rewards/accuracies': 1.0, 'rewards/margins': 17.282115936279297, 'logps/rejected': -424.56085205078125, 'logps/chosen': -525.23974609375, 'logits/rejected': -2.4785666465759277, 'logits/chosen': -1.940072774887085, 'epoch': 0.32}
{'loss': 0.0002, 'learning_rate': 7.579617834394906e-05, 'rewards/chosen': 5.505762577056885, 'rewards/rejected': -3.1319046020507812, 'rewards/accuracies': 1.0, 'rewards/margins': 8.637666702270508, 'logps/rejected': -370.94403076171875, 'logps/chosen': -247.8173828125, 'logits/rejected': -2.751767158508301, 'logits/chosen': -2.9544081687927246, 'epoch': 0.32}
{'loss': 0.0001, 'learning_rate': 7.547770700636943e-05, 'rewards/chosen': 14.717461585998535, 'rewards/rejected': 0.9699110984802246, 'rewards/accuracies': 1.0, 'rewards/margins': 13.747550964355469, 'logps/rejected': -504.3009033203125, 'logps/chosen': -570.2003784179688, 'logits/rejected': -2.464700222015381, 'logits/chosen': -2.142775535583496, 'epoch': 0.32}
{'loss': 0.0009, 'learning_rate': 7.515923566878981e-05, 'rewards/chosen': -4.388559341430664, 'rewards/rejected': -11.886332511901855, 'rewards/accuracies': 1.0, 'rewards/margins': 7.497773170471191, 'logps/rejected': -406.11334228515625, 'logps/chosen': -499.8855895996094, 'logits/rejected': -2.8632426261901855, 'logits/chosen': -2.5373971462249756, 'epoch': 0.32}
{'loss': 0.0, 'learning_rate': 7.484076433121019e-05, 'rewards/chosen': 6.876800537109375, 'rewards/rejected': -13.536703109741211, 'rewards/accuracies': 1.0, 'rewards/margins': 20.41350555419922, 'logps/rejected': -775.6170654296875, 'logps/chosen': -804.2319946289062, 'logits/rejected': -1.8631192445755005, 'logits/chosen': -1.8307678699493408, 'epoch': 0.33}
{'loss': 0.0001, 'learning_rate': 7.452229299363057e-05, 'rewards/chosen': 1.1546753644943237, 'rewards/rejected': -8.128044128417969, 'rewards/accuracies': 1.0, 'rewards/margins': 9.282719612121582, 'logps/rejected': -315.53045654296875, 'logps/chosen': -242.6407470703125, 'logits/rejected': -2.1546833515167236, 'logits/chosen': -2.229414701461792, 'epoch': 0.33}
{'loss': 0.0, 'learning_rate': 7.420382165605096e-05, 'rewards/chosen': -2.3474152088165283, 'rewards/rejected': -16.085920333862305, 'rewards/accuracies': 1.0, 'rewards/margins': 13.738504409790039, 'logps/rejected': -481.17169189453125, 'logps/chosen': -440.5991516113281, 'logits/rejected': -2.14221453666687, 'logits/chosen': -2.253493309020996, 'epoch': 0.33}
{'loss': 0.0001, 'learning_rate': 7.388535031847134e-05, 'rewards/chosen': 4.643071174621582, 'rewards/rejected': -5.687909126281738, 'rewards/accuracies': 1.0, 'rewards/margins': 10.33098030090332, 'logps/rejected': -519.8165893554688, 'logps/chosen': -545.7567749023438, 'logits/rejected': -2.344327688217163, 'logits/chosen': -2.300497531890869, 'epoch': 0.34}
{'loss': 0.0049, 'learning_rate': 7.356687898089171e-05, 'rewards/chosen': 9.287142753601074, 'rewards/rejected': 0.3394317626953125, 'rewards/accuracies': 1.0, 'rewards/margins': 8.947710990905762, 'logps/rejected': -632.605712890625, 'logps/chosen': -689.6285400390625, 'logits/rejected': -2.044267177581787, 'logits/chosen': -2.1511316299438477, 'epoch': 0.34}
{'loss': 0.0006, 'learning_rate': 7.32484076433121e-05, 'rewards/chosen': 6.710669040679932, 'rewards/rejected': -4.666891574859619, 'rewards/accuracies': 1.0, 'rewards/margins': 11.377561569213867, 'logps/rejected': -456.9814147949219, 'logps/chosen': -792.143310546875, 'logits/rejected': -2.622736692428589, 'logits/chosen': -1.8842990398406982, 'epoch': 0.34}
{'loss': 0.044, 'learning_rate': 7.292993630573248e-05, 'rewards/chosen': -2.6809020042419434, 'rewards/rejected': -9.3916015625, 'rewards/accuracies': 1.0, 'rewards/margins': 6.710699558258057, 'logps/rejected': -834.666015625, 'logps/chosen': -822.3090209960938, 'logits/rejected': -1.9676154851913452, 'logits/chosen': -2.02280855178833, 'epoch': 0.34}
{'loss': 0.0002, 'learning_rate': 7.261146496815286e-05, 'rewards/chosen': 4.605342864990234, 'rewards/rejected': -5.839235305786133, 'rewards/accuracies': 1.0, 'rewards/margins': 10.444578170776367, 'logps/rejected': -344.204833984375, 'logps/chosen': -300.2590637207031, 'logits/rejected': -2.2010810375213623, 'logits/chosen': -2.3597748279571533, 'epoch': 0.35}
{'loss': 0.0033, 'learning_rate': 7.229299363057325e-05, 'rewards/chosen': 7.049083232879639, 'rewards/rejected': 0.7859101295471191, 'rewards/accuracies': 1.0, 'rewards/margins': 6.2631731033325195, 'logps/rejected': -373.1408996582031, 'logps/chosen': -387.8216552734375, 'logits/rejected': -2.7773587703704834, 'logits/chosen': -2.546373128890991, 'epoch': 0.35}
{'loss': 0.0001, 'learning_rate': 7.197452229299363e-05, 'rewards/chosen': 3.1525070667266846, 'rewards/rejected': -6.309135437011719, 'rewards/accuracies': 1.0, 'rewards/margins': 9.46164321899414, 'logps/rejected': -439.84136962890625, 'logps/chosen': -428.28741455078125, 'logits/rejected': -2.3518757820129395, 'logits/chosen': -2.1338038444519043, 'epoch': 0.35}
{'loss': 0.0, 'learning_rate': 7.165605095541401e-05, 'rewards/chosen': 5.651843070983887, 'rewards/rejected': -8.583584785461426, 'rewards/accuracies': 1.0, 'rewards/margins': 14.235427856445312, 'logps/rejected': -597.7108154296875, 'logps/chosen': -584.7315673828125, 'logits/rejected': -2.0361692905426025, 'logits/chosen': -2.281634569168091, 'epoch': 0.36}
{'loss': 0.264, 'learning_rate': 7.13375796178344e-05, 'rewards/chosen': -0.7381348609924316, 'rewards/rejected': -5.893651008605957, 'rewards/accuracies': 1.0, 'rewards/margins': 5.155515670776367, 'logps/rejected': -297.3115234375, 'logps/chosen': -240.31884765625, 'logits/rejected': -1.9514199495315552, 'logits/chosen': -2.0606911182403564, 'epoch': 0.36}
{'loss': 0.0001, 'learning_rate': 7.101910828025478e-05, 'rewards/chosen': 14.760913848876953, 'rewards/rejected': 2.579315185546875, 'rewards/accuracies': 1.0, 'rewards/margins': 12.181598663330078, 'logps/rejected': -610.3318481445312, 'logps/chosen': -763.640869140625, 'logits/rejected': -2.1484599113464355, 'logits/chosen': -1.7372498512268066, 'epoch': 0.36}
{'loss': 0.0002, 'learning_rate': 7.070063694267515e-05, 'rewards/chosen': 5.218622207641602, 'rewards/rejected': -5.029797554016113, 'rewards/accuracies': 1.0, 'rewards/margins': 10.248419761657715, 'logps/rejected': -678.2979736328125, 'logps/chosen': -687.0637817382812, 'logits/rejected': -2.0594353675842285, 'logits/chosen': -2.3021631240844727, 'epoch': 0.36}
{'loss': 0.0, 'learning_rate': 7.038216560509555e-05, 'rewards/chosen': 6.223532199859619, 'rewards/rejected': -11.021018981933594, 'rewards/accuracies': 1.0, 'rewards/margins': 17.244550704956055, 'logps/rejected': -562.710205078125, 'logps/chosen': -420.1396789550781, 'logits/rejected': -1.9861546754837036, 'logits/chosen': -2.372985363006592, 'epoch': 0.37}
{'loss': 2.504, 'learning_rate': 7.006369426751592e-05, 'rewards/chosen': 13.714824676513672, 'rewards/rejected': 8.127889633178711, 'rewards/accuracies': 0.5, 'rewards/margins': 5.586935043334961, 'logps/rejected': -869.4710693359375, 'logps/chosen': -531.1017456054688, 'logits/rejected': -1.6943492889404297, 'logits/chosen': -2.5908071994781494, 'epoch': 0.37}
{'loss': 0.0, 'learning_rate': 6.97452229299363e-05, 'rewards/chosen': 1.4372588396072388, 'rewards/rejected': -10.299063682556152, 'rewards/accuracies': 1.0, 'rewards/margins': 11.736322402954102, 'logps/rejected': -641.1156005859375, 'logps/chosen': -492.5024108886719, 'logits/rejected': -1.9814062118530273, 'logits/chosen': -2.302802085876465, 'epoch': 0.37}
{'loss': 0.5027, 'learning_rate': 6.942675159235669e-05, 'rewards/chosen': 10.608787536621094, 'rewards/rejected': 3.0217623710632324, 'rewards/accuracies': 0.5, 'rewards/margins': 7.587025165557861, 'logps/rejected': -411.4073791503906, 'logps/chosen': -223.78712463378906, 'logits/rejected': -1.973405122756958, 'logits/chosen': -3.2445526123046875, 'epoch': 0.38}
{'loss': 0.0, 'learning_rate': 6.910828025477707e-05, 'rewards/chosen': 4.899528503417969, 'rewards/rejected': -10.966453552246094, 'rewards/accuracies': 1.0, 'rewards/margins': 15.865982055664062, 'logps/rejected': -380.28955078125, 'logps/chosen': -494.12969970703125, 'logits/rejected': -2.6983962059020996, 'logits/chosen': -2.225029706954956, 'epoch': 0.38}
{'loss': 0.0, 'learning_rate': 6.878980891719745e-05, 'rewards/chosen': 11.839324951171875, 'rewards/rejected': -2.685455322265625, 'rewards/accuracies': 1.0, 'rewards/margins': 14.524781227111816, 'logps/rejected': -524.6045532226562, 'logps/chosen': -758.6067504882812, 'logits/rejected': -2.4302048683166504, 'logits/chosen': -1.9575306177139282, 'epoch': 0.38}
{'loss': 0.0, 'learning_rate': 6.847133757961784e-05, 'rewards/chosen': 11.04971694946289, 'rewards/rejected': -3.556483507156372, 'rewards/accuracies': 1.0, 'rewards/margins': 14.606200218200684, 'logps/rejected': -514.3148193359375, 'logps/chosen': -706.5028076171875, 'logits/rejected': -2.2564215660095215, 'logits/chosen': -1.7194455862045288, 'epoch': 0.38}
{'loss': 0.0, 'learning_rate': 6.815286624203822e-05, 'rewards/chosen': 4.524472236633301, 'rewards/rejected': -6.69683837890625, 'rewards/accuracies': 1.0, 'rewards/margins': 11.221311569213867, 'logps/rejected': -765.4683837890625, 'logps/chosen': -817.0052490234375, 'logits/rejected': -1.5738071203231812, 'logits/chosen': -1.8817206621170044, 'epoch': 0.39}
{'loss': 0.0, 'learning_rate': 6.78343949044586e-05, 'rewards/chosen': 11.355279922485352, 'rewards/rejected': -7.511019706726074, 'rewards/accuracies': 1.0, 'rewards/margins': 18.86629867553711, 'logps/rejected': -308.3601989746094, 'logps/chosen': -318.32220458984375, 'logits/rejected': -1.844038963317871, 'logits/chosen': -1.327315092086792, 'epoch': 0.39}
{'loss': 0.0006, 'learning_rate': 6.751592356687899e-05, 'rewards/chosen': 4.843526363372803, 'rewards/rejected': -4.231042385101318, 'rewards/accuracies': 1.0, 'rewards/margins': 9.074568748474121, 'logps/rejected': -313.9354248046875, 'logps/chosen': -491.0647277832031, 'logits/rejected': -2.931710720062256, 'logits/chosen': -2.0411179065704346, 'epoch': 0.39}
{'loss': 0.0, 'learning_rate': 6.719745222929936e-05, 'rewards/chosen': 7.366534233093262, 'rewards/rejected': -9.568004608154297, 'rewards/accuracies': 1.0, 'rewards/margins': 16.934539794921875, 'logps/rejected': -340.0550537109375, 'logps/chosen': -382.52215576171875, 'logits/rejected': -2.957770586013794, 'logits/chosen': -2.667160987854004, 'epoch': 0.4}
{'loss': 0.0, 'learning_rate': 6.687898089171974e-05, 'rewards/chosen': 6.2254180908203125, 'rewards/rejected': -8.35632610321045, 'rewards/accuracies': 1.0, 'rewards/margins': 14.581745147705078, 'logps/rejected': -309.5007629394531, 'logps/chosen': -237.05831909179688, 'logits/rejected': -2.4495480060577393, 'logits/chosen': -1.7007445096969604, 'epoch': 0.4}
{'loss': 0.0, 'learning_rate': 6.656050955414013e-05, 'rewards/chosen': 6.690666198730469, 'rewards/rejected': -6.582212924957275, 'rewards/accuracies': 1.0, 'rewards/margins': 13.272879600524902, 'logps/rejected': -392.0721435546875, 'logps/chosen': -459.28082275390625, 'logits/rejected': -3.1004161834716797, 'logits/chosen': -2.4703783988952637, 'epoch': 0.4}
{'loss': 0.0002, 'learning_rate': 6.624203821656051e-05, 'rewards/chosen': 1.9429658651351929, 'rewards/rejected': -8.731213569641113, 'rewards/accuracies': 1.0, 'rewards/margins': 10.674179077148438, 'logps/rejected': -588.6871337890625, 'logps/chosen': -630.3203125, 'logits/rejected': -1.720841646194458, 'logits/chosen': -1.848907232284546, 'epoch': 0.4}
{'loss': 0.0086, 'learning_rate': 6.592356687898089e-05, 'rewards/chosen': 11.39958381652832, 'rewards/rejected': 0.19260549545288086, 'rewards/accuracies': 1.0, 'rewards/margins': 11.206977844238281, 'logps/rejected': -501.6364440917969, 'logps/chosen': -416.691650390625, 'logits/rejected': -2.5978596210479736, 'logits/chosen': -2.5201587677001953, 'epoch': 0.41}
{'loss': 0.0008, 'learning_rate': 6.560509554140127e-05, 'rewards/chosen': 2.2121033668518066, 'rewards/rejected': -4.89544677734375, 'rewards/accuracies': 1.0, 'rewards/margins': 7.107550144195557, 'logps/rejected': -698.4544677734375, 'logps/chosen': -887.8789672851562, 'logits/rejected': -2.024108648300171, 'logits/chosen': -1.648010492324829, 'epoch': 0.41}
{'loss': 0.0, 'learning_rate': 6.528662420382166e-05, 'rewards/chosen': 4.553328037261963, 'rewards/rejected': -10.840421676635742, 'rewards/accuracies': 1.0, 'rewards/margins': 15.39375114440918, 'logps/rejected': -559.9667358398438, 'logps/chosen': -621.0292358398438, 'logits/rejected': -2.3250973224639893, 'logits/chosen': -2.215346574783325, 'epoch': 0.41}
{'loss': 0.0165, 'learning_rate': 6.496815286624204e-05, 'rewards/chosen': 1.7990188598632812, 'rewards/rejected': -4.829507827758789, 'rewards/accuracies': 1.0, 'rewards/margins': 6.62852668762207, 'logps/rejected': -360.9200744628906, 'logps/chosen': -335.38482666015625, 'logits/rejected': -1.9434758424758911, 'logits/chosen': -1.8661965131759644, 'epoch': 0.42}
{'loss': 0.0, 'learning_rate': 6.464968152866241e-05, 'rewards/chosen': 4.0138421058654785, 'rewards/rejected': -16.281967163085938, 'rewards/accuracies': 1.0, 'rewards/margins': 20.29581069946289, 'logps/rejected': -440.6946716308594, 'logps/chosen': -192.986572265625, 'logits/rejected': -2.7876250743865967, 'logits/chosen': -3.2149627208709717, 'epoch': 0.42}
{'loss': 0.0, 'learning_rate': 6.43312101910828e-05, 'rewards/chosen': 3.59165358543396, 'rewards/rejected': -14.925336837768555, 'rewards/accuracies': 1.0, 'rewards/margins': 18.516990661621094, 'logps/rejected': -699.8783569335938, 'logps/chosen': -620.58349609375, 'logits/rejected': -2.028172492980957, 'logits/chosen': -2.1151368618011475, 'epoch': 0.42}
{'loss': 0.0, 'learning_rate': 6.401273885350318e-05, 'rewards/chosen': 0.22355499863624573, 'rewards/rejected': -16.23183250427246, 'rewards/accuracies': 1.0, 'rewards/margins': 16.455387115478516, 'logps/rejected': -454.0683288574219, 'logps/chosen': -277.88946533203125, 'logits/rejected': -2.1338248252868652, 'logits/chosen': -2.5831775665283203, 'epoch': 0.42}
{'loss': 0.0, 'learning_rate': 6.369426751592356e-05, 'rewards/chosen': 11.403701782226562, 'rewards/rejected': -5.776883125305176, 'rewards/accuracies': 1.0, 'rewards/margins': 17.180583953857422, 'logps/rejected': -456.5188293457031, 'logps/chosen': -469.9004821777344, 'logits/rejected': -2.6314449310302734, 'logits/chosen': -2.316568374633789, 'epoch': 0.43}
{'loss': 0.0581, 'learning_rate': 6.337579617834395e-05, 'rewards/chosen': 3.732957363128662, 'rewards/rejected': -5.013148784637451, 'rewards/accuracies': 1.0, 'rewards/margins': 8.746106147766113, 'logps/rejected': -561.7564697265625, 'logps/chosen': -582.67041015625, 'logits/rejected': -2.456730842590332, 'logits/chosen': -2.449205160140991, 'epoch': 0.43}
{'loss': 0.0, 'learning_rate': 6.305732484076433e-05, 'rewards/chosen': 3.3902480602264404, 'rewards/rejected': -7.9807586669921875, 'rewards/accuracies': 1.0, 'rewards/margins': 11.371006965637207, 'logps/rejected': -521.6826171875, 'logps/chosen': -455.0975341796875, 'logits/rejected': -2.1995739936828613, 'logits/chosen': -2.546550989151001, 'epoch': 0.43}
{'loss': 0.0, 'learning_rate': 6.273885350318471e-05, 'rewards/chosen': -0.40141141414642334, 'rewards/rejected': -18.343461990356445, 'rewards/accuracies': 1.0, 'rewards/margins': 17.94205093383789, 'logps/rejected': -482.68463134765625, 'logps/chosen': -273.01409912109375, 'logits/rejected': -2.092189073562622, 'logits/chosen': -2.8783841133117676, 'epoch': 0.44}
{'loss': 0.0954, 'learning_rate': 6.24203821656051e-05, 'rewards/chosen': 1.710342526435852, 'rewards/rejected': -7.232794284820557, 'rewards/accuracies': 1.0, 'rewards/margins': 8.943137168884277, 'logps/rejected': -623.4529418945312, 'logps/chosen': -697.1466064453125, 'logits/rejected': -1.69803786277771, 'logits/chosen': -1.7117992639541626, 'epoch': 0.44}
{'loss': 0.0002, 'learning_rate': 6.210191082802548e-05, 'rewards/chosen': 2.972622871398926, 'rewards/rejected': -15.696576118469238, 'rewards/accuracies': 1.0, 'rewards/margins': 18.669198989868164, 'logps/rejected': -854.2157592773438, 'logps/chosen': -508.7737731933594, 'logits/rejected': -1.6063309907913208, 'logits/chosen': -2.3155462741851807, 'epoch': 0.44}
{'loss': 0.0, 'learning_rate': 6.178343949044585e-05, 'rewards/chosen': 1.0704132318496704, 'rewards/rejected': -18.702499389648438, 'rewards/accuracies': 1.0, 'rewards/margins': 19.772912979125977, 'logps/rejected': -365.8999938964844, 'logps/chosen': -213.79586791992188, 'logits/rejected': -3.0242433547973633, 'logits/chosen': -3.299490213394165, 'epoch': 0.44}
{'loss': 0.0, 'learning_rate': 6.146496815286625e-05, 'rewards/chosen': -2.3960084915161133, 'rewards/rejected': -17.971952438354492, 'rewards/accuracies': 1.0, 'rewards/margins': 15.575944900512695, 'logps/rejected': -573.156982421875, 'logps/chosen': -474.8975830078125, 'logits/rejected': -2.096188545227051, 'logits/chosen': -2.164766550064087, 'epoch': 0.45}
{'loss': 0.0, 'learning_rate': 6.114649681528662e-05, 'rewards/chosen': 3.426422119140625, 'rewards/rejected': -9.717076301574707, 'rewards/accuracies': 1.0, 'rewards/margins': 13.143497467041016, 'logps/rejected': -496.4832763671875, 'logps/chosen': -321.86077880859375, 'logits/rejected': -1.8379472494125366, 'logits/chosen': -2.716866970062256, 'epoch': 0.45}
{'loss': 0.0, 'learning_rate': 6.082802547770701e-05, 'rewards/chosen': 6.8262038230896, 'rewards/rejected': -14.859349250793457, 'rewards/accuracies': 1.0, 'rewards/margins': 21.6855525970459, 'logps/rejected': -750.718505859375, 'logps/chosen': -682.3629760742188, 'logits/rejected': -2.0546839237213135, 'logits/chosen': -2.2049689292907715, 'epoch': 0.45}
{'loss': 0.0, 'learning_rate': 6.0509554140127386e-05, 'rewards/chosen': 2.7787294387817383, 'rewards/rejected': -8.083473205566406, 'rewards/accuracies': 1.0, 'rewards/margins': 10.862201690673828, 'logps/rejected': -535.084716796875, 'logps/chosen': -583.2127075195312, 'logits/rejected': -2.646955966949463, 'logits/chosen': -2.201284408569336, 'epoch': 0.46}
{'loss': 0.0, 'learning_rate': 6.019108280254777e-05, 'rewards/chosen': 3.876391649246216, 'rewards/rejected': -11.440515518188477, 'rewards/accuracies': 1.0, 'rewards/margins': 15.31690788269043, 'logps/rejected': -783.4051513671875, 'logps/chosen': -781.236083984375, 'logits/rejected': -1.8889861106872559, 'logits/chosen': -2.099642038345337, 'epoch': 0.46}
{'loss': 0.0036, 'learning_rate': 5.9872611464968155e-05, 'rewards/chosen': -1.8177169561386108, 'rewards/rejected': -15.177860260009766, 'rewards/accuracies': 1.0, 'rewards/margins': 13.360142707824707, 'logps/rejected': -438.4035949707031, 'logps/chosen': -375.92718505859375, 'logits/rejected': -2.1358468532562256, 'logits/chosen': -2.046236515045166, 'epoch': 0.46}
{'loss': 0.0, 'learning_rate': 5.955414012738853e-05, 'rewards/chosen': 4.2187347412109375, 'rewards/rejected': -19.058143615722656, 'rewards/accuracies': 1.0, 'rewards/margins': 23.276878356933594, 'logps/rejected': -732.9564208984375, 'logps/chosen': -765.3126220703125, 'logits/rejected': -1.8684414625167847, 'logits/chosen': -1.9312584400177002, 'epoch': 0.46}
{'loss': 0.0, 'learning_rate': 5.923566878980892e-05, 'rewards/chosen': 3.24969482421875, 'rewards/rejected': -11.781109809875488, 'rewards/accuracies': 1.0, 'rewards/margins': 15.030803680419922, 'logps/rejected': -542.1860961914062, 'logps/chosen': -393.7530517578125, 'logits/rejected': -2.25203537940979, 'logits/chosen': -2.4565236568450928, 'epoch': 0.47}
{'loss': 0.0, 'learning_rate': 5.89171974522293e-05, 'rewards/chosen': 1.3560997247695923, 'rewards/rejected': -25.83590316772461, 'rewards/accuracies': 1.0, 'rewards/margins': 27.192005157470703, 'logps/rejected': -737.8590087890625, 'logps/chosen': -595.6890258789062, 'logits/rejected': -2.0568363666534424, 'logits/chosen': -2.0471320152282715, 'epoch': 0.47}
{'loss': 0.0, 'learning_rate': 5.859872611464968e-05, 'rewards/chosen': 2.851179599761963, 'rewards/rejected': -14.86967945098877, 'rewards/accuracies': 1.0, 'rewards/margins': 17.72085952758789, 'logps/rejected': -397.44677734375, 'logps/chosen': -260.23822021484375, 'logits/rejected': -2.041017532348633, 'logits/chosen': -2.2028915882110596, 'epoch': 0.47}
{'loss': 0.0, 'learning_rate': 5.8280254777070065e-05, 'rewards/chosen': 0.00277554988861084, 'rewards/rejected': -16.55206298828125, 'rewards/accuracies': 1.0, 'rewards/margins': 16.554840087890625, 'logps/rejected': -666.5206298828125, 'logps/chosen': -325.22222900390625, 'logits/rejected': -2.114144802093506, 'logits/chosen': -2.7643439769744873, 'epoch': 0.48}
{'loss': 0.0, 'learning_rate': 5.796178343949045e-05, 'rewards/chosen': 8.587669372558594, 'rewards/rejected': -15.744989395141602, 'rewards/accuracies': 1.0, 'rewards/margins': 24.332658767700195, 'logps/rejected': -543.1373901367188, 'logps/chosen': -546.248291015625, 'logits/rejected': -2.4985835552215576, 'logits/chosen': -2.302708864212036, 'epoch': 0.48}
{'loss': 0.0, 'learning_rate': 5.764331210191083e-05, 'rewards/chosen': 1.9416108131408691, 'rewards/rejected': -12.89147663116455, 'rewards/accuracies': 1.0, 'rewards/margins': 14.833087921142578, 'logps/rejected': -513.977294921875, 'logps/chosen': -580.5838623046875, 'logits/rejected': -2.6403989791870117, 'logits/chosen': -2.2291646003723145, 'epoch': 0.48}
{'loss': 0.0, 'learning_rate': 5.732484076433121e-05, 'rewards/chosen': -1.0275239944458008, 'rewards/rejected': -18.848756790161133, 'rewards/accuracies': 1.0, 'rewards/margins': 17.821231842041016, 'logps/rejected': -553.300048828125, 'logps/chosen': -462.2752380371094, 'logits/rejected': -2.4150876998901367, 'logits/chosen': -2.4466655254364014, 'epoch': 0.48}
Minami-su commented 10 months ago

code,I wonder if this is normal?:

"""
python Sakura_DPO.py \
    --base_model Qwen-14B-Chat \
    --ref_model Qwen-1_8B-Chat \
    --data-path  Sakurajima_Mai_dpo.json \
    --output_dir Sakurajima_Mai_dpo \
    --num_epochs 1 \
    --batch_size 1 \
    --micro_batch_size 1 \
    --learning_rate 0.0001 \
    --lora_r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lr_scheduler 'linear' \
    --warmup_ratio 0.1 \
    --cutoff_len 768
##########################
transformers
bitsandbytes
evaluate
peft
transformers_stream_generator
tiktoken
fire
trl
"""
import os
import sys
from typing import List

import fire
import torch
import transformers
#import kosy_transformers
from datasets import load_dataset, Dataset

from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from torch.nn import functional as F

from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    set_peft_model_state_dict
)

from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import DPOTrainer
import bitsandbytes as bnb
#torch.autograd.set_detect_anomaly(True)
def find_all_linear_names(model):
    #cls = bnb.nn.Linear8bitLt 
    cls = bnb.nn.Linear4bit 
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

def train(
    # model/data params
    base_model: str = "", 
    ref_model: str = "", 
    data_path: str = "",
    output_dir: str = "",
    # training hyperparams
    batch_size: int = 128,
    micro_batch_size: int = 8,
    num_epochs: int = 1,
    learning_rate: float = 3e-4,
    cutoff_len: int = 4096,
    val_set_size: int = 0,
    lr_scheduler: str = "cosine",
    warmup_ratio: float = 0.1, 
    # lora hyperparams
    lora_r: int = 16,
    lora_alpha: int = 16,
    lora_dropout: float = 0.05,
    # from peft docs: ["q_proj", "k_proj", "v_proj", "o_proj", "fc_in", "fc_out", "wte", "gate_proj", "down_proj", "up_proj"]
    lora_target_modules: List[str] = ["gate_proj", "down_proj", "up_proj"],
    # llm hyperparams
    train_on_inputs: bool = False,  # if False, masks out inputs in loss
    add_eos_token: bool = False,
    group_by_length: bool = False,  # faster, but produces an odd training loss curve
    # wandb params
    #wandb_project: str = "",
    #wandb_run_name: str = "",
    #wandb_watch: str = "",  # options: false | gradients | all
    #wandb_log_model: str = "",  # options: false | true
    resume_from_checkpoint: str = None,  # either training checkpoint or final adapter
    prompt_template_name: str = "alpaca",
    # NEFTune params
    noise_alpha: int = 5
):
    if int(os.environ.get("LOCAL_RANK", 0)) == 0:
        print(
            f"Params using prompt template {prompt_template_name}:\n"
            f"base_model: {base_model}\n"
            f"data_path: {data_path}\n"
            f"output_dir: {output_dir}\n"
            f"batch_size: {batch_size}\n"
            f"micro_batch_size: {micro_batch_size}\n"
            f"num_epochs: {num_epochs}\n"
            f"learning_rate: {learning_rate}\n"
            f"cutoff_len: {cutoff_len}\n"
            f"val_set_size: {val_set_size}\n"
            f"lr_scheduler: {lr_scheduler}\n"
            f"warmup_ratio: {warmup_ratio}\n"
            f"lora_r: {lora_r}\n"
            f"lora_alpha: {lora_alpha}\n"
            f"lora_dropout: {lora_dropout}\n"
            f"lora_target_modules: {lora_target_modules}\n"
            f"train_on_inputs: {train_on_inputs}\n"
            f"add_eos_token: {add_eos_token}\n"
            f"group_by_length: {group_by_length}\n"
            #f"wandb_project: {wandb_project}\n"
            #f"wandb_run_name: {wandb_run_name}\n"
            #f"wandb_watch: {wandb_watch}\n"
            #f"wandb_log_model: {wandb_log_model}\n"
            f"resume_from_checkpoint: {resume_from_checkpoint or False}\n"
        )
    assert (
        base_model
    ), "Please specify a --base_model, e.g. --base_model='huggyllama/llama-7b'"

    # from huggingface_hub import login
    # login(token='[...your_token...]')

    gradient_accumulation_steps = batch_size // micro_batch_size

    device_map = "auto"
    world_size = int(os.environ.get("WORLD_SIZE", 1))

    ddp = world_size != 1 # world_size = 1
    if ddp:
        device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} # auto
        gradient_accumulation_steps = gradient_accumulation_steps // world_size
        print("gradient_accumulation_steps: ", gradient_accumulation_steps)
    print("############DDP:",ddp) # False

    # Check if parameter passed or if set within environ
    '''
    use_wandb = len(wandb_project) > 0 or (
        "WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0
    )
    # Only overwrite environ if wandb param passed
    if len(wandb_project) > 0:
        os.environ["WANDB_PROJECT"] = wandb_project
    if len(wandb_watch) > 0:
        os.environ["WANDB_WATCH"] = wandb_watch
    if len(wandb_log_model) > 0:
        os.environ["WANDB_LOG_MODEL"] = wandb_log_model
    '''

    #model = LlamaForCausalLM.from_pretrained(
    #    base_model,
    #    load_in_8bit=True, # LoRA
    #    #load_in_4bit=True, # QLoRA
    #    torch_dtype=torch.float16,
    #    device_map=device_map)

    # Original
    #tokenizer = LlamaTokenizer.from_pretrained(base_model)

    # 1. Define policy and reference models
    # model = AutoModelForCausalLM.from_pretrained(
    #     base_model, # location of saved SFT model
    #     low_cpu_mem_usage=True,
    #     torch_dtype=torch.float16,
    #     device_map = device_map
    # )
    from accelerate import Accelerator
    model = AutoModelForCausalLM.from_pretrained(base_model,trust_remote_code=True,quantization_config=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type='nf4'),device_map={"": Accelerator().local_process_index})
    model_ref = AutoModelForCausalLM.from_pretrained(ref_model,trust_remote_code=True,quantization_config=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type='nf4'),device_map={"": Accelerator().local_process_index})

    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

    #model_ref = AutoModelForCausalLM.from_pretrained(
    #    base_model,  # same model as the main one
    #    low_cpu_mem_usage=True,
    #    torch_dtype=torch.float16,
    #    load_in_4bit=True,
    #    quantization_config=bnb_config
    #)

    tokenizer = AutoTokenizer.from_pretrained(base_model,trust_remote_code=True)

    print(type(model))
    print(model)
    print("length of tokenizer:",len(tokenizer))
    if base_model.find("qwen") != -1 or base_model.find("Qwen") != -1:
        tokenizer.pad_token_id = tokenizer.im_end_id
        tokenizer.bos_token_id = tokenizer.im_start_id
        tokenizer.eos_token_id = tokenizer.im_end_id
        bos = tokenizer.bos_token_id
        eos = tokenizer.eos_token_id
        pad = tokenizer.pad_token_id
    else:
        bos = tokenizer.bos_token_id
        eos = tokenizer.eos_token_id
        pad = tokenizer.pad_token_id
        print("pre-trained model's BOS EOS and PAD token id:",bos,eos,pad," => It should be 1 2 None")

        tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token
    tokenizer.padding_side = "right"

    # 2. Define dataset
    def return_prompt_and_responses(samples):

        return {
            "prompt": "",
            "chosen": samples["chosen"],
            "rejected": samples["rejected"],
        }
    #dataset = load_dataset(data_path)
    if data_path.endswith(".json") or data_path.endswith(".jsonl"):
        dataset = load_dataset("json", data_files=data_path)
    else:
        dataset = load_dataset(data_path)
    train_dataset = dataset.map(return_prompt_and_responses)
    # train_dataset = train_dataset.filter(
    #     lambda x: len(x["chosen"]) <= cutoff_len
    #     and len(x["rejected"]) <= cutoff_len
    # )
    train_dataset = train_dataset.map(
    lambda x: {
        "chosen": tokenizer.bos_token+x["chosen"][:cutoff_len]+tokenizer.eos_token,
        "rejected": tokenizer.bos_token+x["rejected"][:cutoff_len]+tokenizer.eos_token
    }
)
    train_dataset = train_dataset["train"].shuffle()
    #print(tokenizer.decode(train_dataset))
    print(train_dataset['chosen'][0])
    print(train_dataset['rejected'][0])

    # 3. Define hyperparameters
    training_args = TrainingArguments(
        num_train_epochs= num_epochs,
        per_device_train_batch_size=micro_batch_size,
        #per_device_eval_batch_size=script_args.per_device_eval_batch_size,
        #max_steps=1000,
        logging_steps=1,
        save_steps=50,
        save_total_limit=2,
        gradient_accumulation_steps=gradient_accumulation_steps,
        #gradient_checkpointing=script_args.gradient_checkpointing,
        learning_rate=learning_rate,
        #evaluation_strategy="steps",
        #eval_steps=script_args.eval_steps,
        output_dir=output_dir,
        #report_to=script_args.report_to,
        lr_scheduler_type=lr_scheduler,
        warmup_ratio=warmup_ratio,
        optim='paged_adamw_32bit', # rmsprop
        remove_unused_columns=False,
        run_name="dpo_kyujin",
    )
    modules = find_all_linear_names(model)
    peft_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=modules,
        bias="none",
        task_type="CAUSAL_LM",
    )

    # DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        ref_model = model_ref, #model_ref,
        args=training_args,
        beta=0.1, # fix
        train_dataset=train_dataset,
        #eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        peft_config=peft_config,
    )

    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)

    # train
    dpo_trainer.train()
    dpo_trainer.save_model(output_dir)

    # save
    output_dir = os.path.join(output_dir, "final_checkpoint")
    dpo_trainer.model.save_pretrained(output_dir)

if __name__ == "__main__":
    torch.cuda.empty_cache() 
    fire.Fire(train)
younesbelkada commented 9 months ago

Hmmm not sure what is happening here - @kashif @lewtun @edbeeching is that a common scenario when training a model with DPO?

Minami-su commented 9 months ago

Hmmm not sure what is happening here - @kashif @lewtun @edbeeching is that a common scenario when training a model with DPO?

Perhaps the learning rate was too high😂. After adjusting it to 1e-6, it seems that the loss did not reach 0.