liucongg / ChatGLM-Finetuning

基于ChatGLM-6B、ChatGLM2-6B、ChatGLM3-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning、全参微调等
2.66k stars 297 forks source link

int4量化 #12

Open kywen1119 opened 1 year ago

kywen1119 commented 1 year ago

您好,请问您训练的时候有尝试原作里的quantize(4)量化吗

liucongg commented 1 year ago

实验还没做quantize(4)量化,因为之前在原模型推理时,int4和int8的效果不理想,因此该方法被pass掉了。

kywen1119 commented 1 year ago

明白了 谢谢 方便发一下您ptuningv2的log文件吗,因为我的显存不够,只能用quantize4训练,但是训练的时候loss并不降,想对比一下loss看看

liucongg commented 1 year ago

明白了 谢谢 方便发一下您ptuningv2的log文件吗,因为我的显存不够,只能用quantize4训练,但是训练的时候loss并不降,想对比一下loss看看

放一部分把,收敛也是比较快的 [2023-04-05 11:17:02,183] [INFO] [stage_1_and_2.py:1769:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0 [2023-04-05 11:17:06,841] [INFO] [logging.py:75:log_dist] [Rank 0] step=20, skipped=19, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:17:06,853] [INFO] [timer.py:198:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=1.2978227400649138, CurrSamplesPerSec=0.4284308101959755, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:2.384765625, global_step:20 [2023-04-05 11:17:21,722] [INFO] [stage_1_and_2.py:1769:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 [2023-04-05 11:17:27,446] [INFO] [stage_1_and_2.py:1769:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0 [2023-04-05 11:17:27,446] [INFO] [logging.py:75:log_dist] [Rank 0] step=30, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:17:27,447] [INFO] [timer.py:198:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=1.1588641319976016, CurrSamplesPerSec=1.4833966401414678, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:8.1015625, global_step:30 [2023-04-05 11:17:49,466] [INFO] [logging.py:75:log_dist] [Rank 0] step=40, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:17:49,479] [INFO] [timer.py:198:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=1.0803512555915895, CurrSamplesPerSec=0.9160440119164451, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:2.419921875, global_step:40 [2023-04-05 11:18:11,611] [INFO] [logging.py:75:log_dist] [Rank 0] step=50, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:18:11,623] [INFO] [timer.py:198:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=1.0380220326264773, CurrSamplesPerSec=0.9072260923019421, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:2.158203125, global_step:50 [2023-04-05 11:18:34,004] [INFO] [logging.py:75:log_dist] [Rank 0] step=60, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:18:34,017] [INFO] [timer.py:198:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=1.0098476966213843, CurrSamplesPerSec=0.9122849602867247, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:2.30859375, global_step:60 [2023-04-05 11:18:56,025] [INFO] [logging.py:75:log_dist] [Rank 0] step=70, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:18:56,038] [INFO] [timer.py:198:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=0.9935600227575752, CurrSamplesPerSec=0.9108542657366956, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:1.0576171875, global_step:70 [2023-04-05 11:19:18,030] [INFO] [logging.py:75:log_dist] [Rank 0] step=80, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:19:18,043] [INFO] [timer.py:198:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=0.9818808053289275, CurrSamplesPerSec=0.9003177500642348, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:1.38671875, global_step:80 [2023-04-05 11:19:40,065] [INFO] [logging.py:75:log_dist] [Rank 0] step=90, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:19:40,078] [INFO] [timer.py:198:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=0.9728787966021624, CurrSamplesPerSec=0.9128704975719492, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:1.0146484375, global_step:90 [2023-04-05 11:20:02,123] [INFO] [logging.py:75:log_dist] [Rank 0] step=100, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:20:02,136] [INFO] [timer.py:198:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=0.965722554070104, CurrSamplesPerSec=0.898128093247804, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.83642578125, global_step:100 [2023-04-05 11:20:24,209] [INFO] [logging.py:75:log_dist] [Rank 0] step=110, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:20:24,221] [INFO] [timer.py:198:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=0.9598564657589548, CurrSamplesPerSec=0.9076261907131945, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.95654296875, global_step:110 [2023-04-05 11:20:46,167] [INFO] [logging.py:75:log_dist] [Rank 0] step=120, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:20:46,180] [INFO] [timer.py:198:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=0.9555242728592735, CurrSamplesPerSec=0.9128707955949885, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.63134765625, global_step:120 [2023-04-05 11:21:08,110] [INFO] [logging.py:75:log_dist] [Rank 0] step=130, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:21:08,122] [INFO] [timer.py:198:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=0.9519581323683804, CurrSamplesPerSec=0.9097415872625454, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.7138671875, global_step:130 [2023-04-05 11:21:30,079] [INFO] [logging.py:75:log_dist] [Rank 0] step=140, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:21:30,092] [INFO] [timer.py:198:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=0.9488380169259213, CurrSamplesPerSec=0.9104971694468935, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.5849609375, global_step:140 [2023-04-05 11:21:52,071] [INFO] [logging.py:75:log_dist] [Rank 0] step=150, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:21:52,084] [INFO] [timer.py:198:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=0.946092990646761, CurrSamplesPerSec=0.9081305461043666, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.280517578125, global_step:150 [2023-04-05 11:22:14,049] [INFO] [logging.py:75:log_dist] [Rank 0] step=160, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:22:14,062] [INFO] [timer.py:198:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=0.9437471176872136, CurrSamplesPerSec=0.9070488307887239, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.92626953125, global_step:160 [2023-04-05 11:22:36,058] [INFO] [logging.py:75:log_dist] [Rank 0] step=170, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:22:36,071] [INFO] [timer.py:198:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=0.9416068021879145, CurrSamplesPerSec=0.9075004102820523, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:2.05859375, global_step:170 [2023-04-05 11:22:58,194] [INFO] [logging.py:75:log_dist] [Rank 0] step=180, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:22:58,207] [INFO] [timer.py:198:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=0.9394111033791612, CurrSamplesPerSec=0.8988332943667124, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.759765625, global_step:180 [2023-04-05 11:23:20,147] [INFO] [logging.py:75:log_dist] [Rank 0] step=190, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:23:20,160] [INFO] [timer.py:198:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=0.9378762254417848, CurrSamplesPerSec=0.9153941594927067, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.74072265625, global_step:190 [2023-04-05 11:23:42,286] [INFO] [logging.py:75:log_dist] [Rank 0] step=200, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:23:42,299] [INFO] [timer.py:198:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=0.9360898627800226, CurrSamplesPerSec=0.9025733634700118, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.58056640625, global_step:200 [2023-04-05 11:24:04,714] [INFO] [logging.py:75:log_dist] [Rank 0] step=210, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:24:04,727] [INFO] [timer.py:198:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=0.9338748713219513, CurrSamplesPerSec=0.8784005054730082, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.57421875, global_step:210 [2023-04-05 11:24:27,411] [INFO] [logging.py:75:log_dist] [Rank 0] step=220, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:24:27,424] [INFO] [timer.py:198:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=0.9313347066926911, CurrSamplesPerSec=0.9132118614444052, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.615234375, global_step:220 [2023-04-05 11:24:49,394] [INFO] [logging.py:75:log_dist] [Rank 0] step=230, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:24:49,407] [INFO] [timer.py:198:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=0.9303850896886615, CurrSamplesPerSec=0.9101330457747033, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.90234375, global_step:230 [2023-04-05 11:25:11,593] [INFO] [logging.py:75:log_dist] [Rank 0] step=240, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:25:11,606] [INFO] [timer.py:198:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=0.9291217749680194, CurrSamplesPerSec=0.8994428640525456, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.40283203125, global_step:240 [2023-04-05 11:25:33,675] [INFO] [logging.py:75:log_dist] [Rank 0] step=250, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:25:33,688] [INFO] [timer.py:198:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=0.9281684835993161, CurrSamplesPerSec=0.8940742669479436, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.52294921875, global_step:250 [2023-04-05 11:25:55,605] [INFO] [logging.py:75:log_dist] [Rank 0] step=260, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:25:55,618] [INFO] [timer.py:198:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=0.9275428767715796, CurrSamplesPerSec=0.9136429306991944, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.525390625, global_step:260 [2023-04-05 11:26:17,601] [INFO] [logging.py:75:log_dist] [Rank 0] step=270, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:26:17,614] [INFO] [timer.py:198:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=0.9268594389902394, CurrSamplesPerSec=0.9048889153532813, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.75048828125, global_step:270 [2023-04-05 11:26:39,829] [INFO] [logging.py:75:log_dist] [Rank 0] step=280, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:26:39,842] [INFO] [timer.py:198:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=0.925869536940587, CurrSamplesPerSec=0.9090092235859744, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.775390625, global_step:280 [2023-04-05 11:27:01,838] [INFO] [logging.py:75:log_dist] [Rank 0] step=290, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:27:01,851] [INFO] [timer.py:198:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=0.9252748349102661, CurrSamplesPerSec=0.9115410686425, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.4052734375, global_step:290 [2023-04-05 11:27:23,803] [INFO] [logging.py:75:log_dist] [Rank 0] step=300, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:27:23,816] [INFO] [timer.py:198:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=0.9247824018489739, CurrSamplesPerSec=0.9124430349084235, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.57958984375, global_step:300 [2023-04-05 11:27:45,807] [INFO] [logging.py:75:log_dist] [Rank 0] step=310, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:27:45,820] [INFO] [timer.py:198:stop] epoch=0/micro_step=310/global_step=310, RunningAvgSamplesPerSec=0.924270142554032, CurrSamplesPerSec=0.9115573134349767, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.31591796875, global_step:310 [2023-04-05 11:28:07,877] [INFO] [logging.py:75:log_dist] [Rank 0] step=320, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:28:07,890] [INFO] [timer.py:198:stop] epoch=0/micro_step=320/global_step=320, RunningAvgSamplesPerSec=0.9237019839881618, CurrSamplesPerSec=0.9079150972588211, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.697265625, global_step:320 [2023-04-05 11:28:29,838] [INFO] [logging.py:75:log_dist] [Rank 0] step=330, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:28:29,852] [INFO] [timer.py:198:stop] epoch=0/micro_step=330/global_step=330, RunningAvgSamplesPerSec=0.923308629837534, CurrSamplesPerSec=0.9093083751984496, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.454833984375, global_step:330 [2023-04-05 11:28:52,049] [INFO] [logging.py:75:log_dist] [Rank 0] step=340, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:28:52,062] [INFO] [timer.py:198:stop] epoch=0/micro_step=340/global_step=340, RunningAvgSamplesPerSec=0.9226262057332437, CurrSamplesPerSec=0.8352693921943818, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.68994140625, global_step:340 [2023-04-05 11:29:14,245] [INFO] [logging.py:75:log_dist] [Rank 0] step=350, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:29:14,258] [INFO] [timer.py:198:stop] epoch=0/micro_step=350/global_step=350, RunningAvgSamplesPerSec=0.9220018016539515, CurrSamplesPerSec=0.8478712121587321, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.62109375, global_step:350 [2023-04-05 11:29:36,459] [INFO] [logging.py:75:log_dist] [Rank 0] step=360, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:29:36,472] [INFO] [timer.py:198:stop] epoch=0/micro_step=360/global_step=360, RunningAvgSamplesPerSec=0.9213917004842379, CurrSamplesPerSec=0.8962323043541575, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.329345703125, global_step:360 [2023-04-05 11:29:59,191] [INFO] [logging.py:75:log_dist] [Rank 0] step=370, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:29:59,204] [INFO] [timer.py:198:stop] epoch=0/micro_step=370/global_step=370, RunningAvgSamplesPerSec=0.9202199276468866, CurrSamplesPerSec=0.9120594043818299, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.5517578125, global_step:370 [2023-04-05 11:30:21,984] [INFO] [logging.py:75:log_dist] [Rank 0] step=380, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:30:21,996] [INFO] [timer.py:198:stop] epoch=0/micro_step=380/global_step=380, RunningAvgSamplesPerSec=0.9190437024220418, CurrSamplesPerSec=0.873035422395503, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.51171875, global_step:380 [2023-04-05 11:30:44,493] [INFO] [logging.py:75:log_dist] [Rank 0] step=390, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:30:44,506] [INFO] [timer.py:198:stop] epoch=0/micro_step=390/global_step=390, RunningAvgSamplesPerSec=0.9182403129525399, CurrSamplesPerSec=0.9052362524510992, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.849609375, global_step:390 [2023-04-05 11:31:06,569] [INFO] [logging.py:75:log_dist] [Rank 0] step=400, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:31:06,582] [INFO] [timer.py:198:stop] epoch=0/micro_step=400/global_step=400, RunningAvgSamplesPerSec=0.9179365633347868, CurrSamplesPerSec=0.9162064941313188, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.5361328125, global_step:400 [2023-04-05 11:31:28,939] [INFO] [logging.py:75:log_dist] [Rank 0] step=410, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:31:28,952] [INFO] [timer.py:198:stop] epoch=0/micro_step=410/global_step=410, RunningAvgSamplesPerSec=0.917344781870788, CurrSamplesPerSec=0.9195614180959596, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.372314453125, global_step:410 [2023-04-05 11:31:51,164] [INFO] [logging.py:75:log_dist] [Rank 0] step=420, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:31:51,177] [INFO] [timer.py:198:stop] epoch=0/micro_step=420/global_step=420, RunningAvgSamplesPerSec=0.9169268729548522, CurrSamplesPerSec=0.9144074021953126, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.34130859375, global_step:420 [2023-04-05 11:32:13,206] [INFO] [logging.py:75:log_dist] [Rank 0] step=430, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:32:13,219] [INFO] [timer.py:198:stop] epoch=0/micro_step=430/global_step=430, RunningAvgSamplesPerSec=0.9167093846791612, CurrSamplesPerSec=0.9097597412773563, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.355224609375, global_step:430 [2023-04-05 11:32:35,251] [INFO] [logging.py:75:log_dist] [Rank 0] step=440, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:32:35,264] [INFO] [timer.py:198:stop] epoch=0/micro_step=440/global_step=440, RunningAvgSamplesPerSec=0.9164981050442088, CurrSamplesPerSec=0.9171127081311214, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.485107421875, global_step:440 [2023-04-05 11:32:57,517] [INFO] [logging.py:75:log_dist] [Rank 0] step=450, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:32:57,530] [INFO] [timer.py:198:stop] epoch=0/micro_step=450/global_step=450, RunningAvgSamplesPerSec=0.9160888408987005, CurrSamplesPerSec=0.8814049846416752, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.442138671875, global_step:450 [2023-04-05 11:33:19,373] [INFO] [logging.py:75:log_dist] [Rank 0] step=460, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:33:19,386] [INFO] [timer.py:198:stop] epoch=0/micro_step=460/global_step=460, RunningAvgSamplesPerSec=0.9160739334276033, CurrSamplesPerSec=0.9104720685613489, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.62255859375, global_step:460 [2023-04-05 11:33:41,438] [INFO] [logging.py:75:log_dist] [Rank 0] step=470, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:33:41,451] [INFO] [timer.py:198:stop] epoch=0/micro_step=470/global_step=470, RunningAvgSamplesPerSec=0.9158727805827236, CurrSamplesPerSec=0.9166882763433433, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.59228515625, global_step:470 [2023-04-05 11:34:03,459] [INFO] [logging.py:75:log_dist] [Rank 0] step=480, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:34:03,472] [INFO] [timer.py:198:stop] epoch=0/micro_step=480/global_step=480, RunningAvgSamplesPerSec=0.9157180764005522, CurrSamplesPerSec=0.9050730479546055, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.27099609375, global_step:480 [2023-04-05 11:34:25,400] [INFO] [logging.py:75:log_dist] [Rank 0] step=490, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:34:25,413] [INFO] [timer.py:198:stop] epoch=0/micro_step=490/global_step=490, RunningAvgSamplesPerSec=0.9156377718781994, CurrSamplesPerSec=0.9168077990775755, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.60888671875, global_step:490 [2023-04-05 11:34:47,250] [INFO] [logging.py:75:log_dist] [Rank 0] step=500, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:34:47,263] [INFO] [timer.py:198:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=0.9156384035876582, CurrSamplesPerSec=0.9142159651428658, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.5654296875, global_step:500 [2023-04-05 11:35:09,648] [INFO] [logging.py:75:log_dist] [Rank 0] step=510, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:35:09,661] [INFO] [timer.py:198:stop] epoch=0/micro_step=510/global_step=510, RunningAvgSamplesPerSec=0.915186587841061, CurrSamplesPerSec=0.8843449606216701, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.6396484375, global_step:510 [2023-04-05 11:35:31,591] [INFO] [logging.py:75:log_dist] [Rank 0] step=520, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:35:31,604] [INFO] [timer.py:198:stop] epoch=0/micro_step=520/global_step=520, RunningAvgSamplesPerSec=0.9151201633726329, CurrSamplesPerSec=0.9110135268318456, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.712890625, global_step:520 [2023-04-05 11:35:53,632] [INFO] [logging.py:75:log_dist] [Rank 0] step=530, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:35:53,645] [INFO] [timer.py:198:stop] epoch=0/micro_step=530/global_step=530, RunningAvgSamplesPerSec=0.9149784553854662, CurrSamplesPerSec=0.8932829945003398, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.145263671875, global_step:530 [2023-04-05 11:36:16,114] [INFO] [logging.py:75:log_dist] [Rank 0] step=540, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:36:16,127] [INFO] [timer.py:198:stop] epoch=0/micro_step=540/global_step=540, RunningAvgSamplesPerSec=0.9144993766021015, CurrSamplesPerSec=0.914435710988185, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.33056640625, global_step:540 [2023-04-05 11:36:38,029] [INFO] [logging.py:75:log_dist] [Rank 0] step=550, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:36:38,042] [INFO] [timer.py:198:stop] epoch=0/micro_step=550/global_step=550, RunningAvgSamplesPerSec=0.9144713686791487, CurrSamplesPerSec=0.9138780318359043, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.27978515625, global_step:550 [2023-04-05 11:37:00,164] [INFO] [logging.py:75:log_dist] [Rank 0] step=560, skipped=21, lr=[1e-05], mom=[[0.9, 0.95]] [2023-04-05 11:37:00,177] [INFO] [timer.py:198:stop] epoch=0/micro_step=560/global_step=560, RunningAvgSamplesPerSec=0.9142791362334722, CurrSamplesPerSec=0.8834383946429685, MemAllocated=15.62GB, MaxMemAllocated=24.74GB loss:0.185546875, global_step:560

kywen1119 commented 1 year ago

收到 非常感谢

eatcosmos commented 1 year ago

image 你好,这个什么时候停止呢?epoch=0/micro_step=560/global_step=560 这边global_step=3800,训练了好几个小时了,为什么还不停止呢? 是直接运行的 finetuning_pt.py,能帮讲解下是不是代码里要修改什么参数?