google-research / pegasus

Apache License 2.0
1.59k stars 315 forks source link

Time for Training #102

Open cindycandy opened 3 years ago

cindycandy commented 3 years ago

Following the steps written in README, I started training aeslc dataset at 18:30 yesterday. However, it is still fine-tuning now. Is it normal? Can I stop the process? Here is the information INFO:tensorflow:examples/sec: 0.158775 I0924 14:56:51.580260 140351671805760 tpu_estimator.py:2308] examples/sec: 0.158775 INFO:tensorflow:global_step/sec: 0.0225665 I0924 14:57:35.893372 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0225665 INFO:tensorflow:examples/sec: 0.180532 I0924 14:57:35.893888 140351671805760 tpu_estimator.py:2308] examples/sec: 0.180532 INFO:tensorflow:global_step/sec: 0.0223011 I0924 14:58:20.734148 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0223011 INFO:tensorflow:examples/sec: 0.178409 I0924 14:58:20.734622 140351671805760 tpu_estimator.py:2308] examples/sec: 0.178409 INFO:tensorflow:global_step/sec: 0.0240086 I0924 14:59:02.385759 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0240086 INFO:tensorflow:examples/sec: 0.192069 I0924 14:59:02.386494 140351671805760 tpu_estimator.py:2308] examples/sec: 0.192069 INFO:tensorflow:global_step/sec: 0.0265066 I0924 14:59:40.112306 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0265066 INFO:tensorflow:examples/sec: 0.212053 I0924 14:59:40.112984 140351671805760 tpu_estimator.py:2308] examples/sec: 0.212053 INFO:tensorflow:global_step/sec: 0.0259231 I0924 15:00:18.687996 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0259231 INFO:tensorflow:examples/sec: 0.207385 I0924 15:00:18.688565 140351671805760 tpu_estimator.py:2308] examples/sec: 0.207385

Your help will be much appreciated.

zouweidong91 commented 3 years ago

global_step/sec: 0.0265066 这是你的训练速度。看下你的步长参数steps一共是多少,自己计算下。 正常GPU训练的话 global_step/sec: 1左右,一个白天肯定可以训练好,估计你用的是cpu。

cindycandy commented 3 years ago

十分感谢!我用的是服务器,以为会自动匹配GPU。步长参数steps怎么看?我可以结束训练,重新开启吗?那样会不会报错?

zouweidong91 commented 3 years ago

pegasus\params\public_params.py line159: "train_steps": 32000. 为防止checkpoint出错,结束后把之前生成的模型文件删掉重新开始就好了,提前测好gpu环境是否能用

cindycandy commented 3 years ago

确认GPU有空闲的,删掉重新跑了,但是sec值还是很低,并且删掉的文件很快就重新生成了,不知道该怎么解决?

zouweidong91 commented 3 years ago

训练文件肯定一旦开启训练就生成了。训练过程中 执行 nvidia-smi ,查看gpu占用情况。如果利用率很低,那就是没有启用,环境搭的可能有问题,最好重新安装tf-gpu: pip install --force-reinstall tensorflow-gpu==1.15.2

cindycandy commented 3 years ago

再次感谢你的回答,是tf-gpu安装的问题,因为用的镜像,所有没有探测GPU。待会儿我尝试一下重装tf-gpu或者提升权限。 现在有另一个问题:我能查看训练出来的摘要和对应的文章吗?他们分别在哪个文件里?

yuye2133 commented 3 years ago

求问,预训练的步骤是怎样的呢

jayasridharmireddi commented 4 months ago

Hello, when i started training for the aeslc dataset, its showing me the following error. Can u plz help me with this?

raise NonMatchingChecksumError(resource.url, tmp_path) tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact https://github.com/ryanzhumich/AESLC/archive/master.zip, downloaded to /home/tbvl/tensorflow_datasets/downloads/ryanzhumich_AESLC_archive_masterACSpoxw627Ay4UrkswMeyz6RrOey8kKfkhEM4VySJWU.zip.tmp.0ec5f533500544b9b901328f413cbb6b/master, has wrong checksum.

Screenshot (600)