Time for Training - Githubissues

cindycandy commented 3 years ago

Following the steps written in README, I started training aeslc dataset at 18:30 yesterday. However, it is still fine-tuning now. Is it normal? Can I stop the process? Here is the information INFO:tensorflow:examples/sec: 0.158775 I0924 14:56:51.580260 140351671805760 tpu_estimator.py:2308] examples/sec: 0.158775 INFO:tensorflow:global_step/sec: 0.0225665 I0924 14:57:35.893372 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0225665 INFO:tensorflow:examples/sec: 0.180532 I0924 14:57:35.893888 140351671805760 tpu_estimator.py:2308] examples/sec: 0.180532 INFO:tensorflow:global_step/sec: 0.0223011 I0924 14:58:20.734148 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0223011 INFO:tensorflow:examples/sec: 0.178409 I0924 14:58:20.734622 140351671805760 tpu_estimator.py:2308] examples/sec: 0.178409 INFO:tensorflow:global_step/sec: 0.0240086 I0924 14:59:02.385759 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0240086 INFO:tensorflow:examples/sec: 0.192069 I0924 14:59:02.386494 140351671805760 tpu_estimator.py:2308] examples/sec: 0.192069 INFO:tensorflow:global_step/sec: 0.0265066 I0924 14:59:40.112306 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0265066 INFO:tensorflow:examples/sec: 0.212053 I0924 14:59:40.112984 140351671805760 tpu_estimator.py:2308] examples/sec: 0.212053 INFO:tensorflow:global_step/sec: 0.0259231 I0924 15:00:18.687996 140351671805760 tpu_estimator.py:2307] global_step/sec: 0.0259231 INFO:tensorflow:examples/sec: 0.207385 I0924 15:00:18.688565 140351671805760 tpu_estimator.py:2308] examples/sec: 0.207385

Your help will be much appreciated.

zouweidong91 commented 3 years ago

global_step/sec: 0.0265066 这是你的训练速度。看下你的步长参数steps一共是多少，自己计算下。正常GPU训练的话 global_step/sec: 1左右，一个白天肯定可以训练好，估计你用的是cpu。

cindycandy commented 3 years ago

十分感谢！我用的是服务器，以为会自动匹配GPU。步长参数steps怎么看？我可以结束训练，重新开启吗？那样会不会报错？

zouweidong91 commented 3 years ago

pegasus\params\public_params.py line159: "train_steps": 32000. 为防止checkpoint出错，结束后把之前生成的模型文件删掉重新开始就好了，提前测好gpu环境是否能用

cindycandy commented 3 years ago

确认GPU有空闲的，删掉重新跑了，但是sec值还是很低，并且删掉的文件很快就重新生成了，不知道该怎么解决？

zouweidong91 commented 3 years ago

训练文件肯定一旦开启训练就生成了。训练过程中执行 nvidia-smi ，查看gpu占用情况。如果利用率很低，那就是没有启用，环境搭的可能有问题，最好重新安装tf-gpu： pip install --force-reinstall tensorflow-gpu==1.15.2

cindycandy commented 3 years ago

再次感谢你的回答，是tf-gpu安装的问题，因为用的镜像，所有没有探测GPU。待会儿我尝试一下重装tf-gpu或者提升权限。现在有另一个问题：我能查看训练出来的摘要和对应的文章吗？他们分别在哪个文件里？

yuye2133 commented 3 years ago

求问，预训练的步骤是怎样的呢

jayasridharmireddi commented 4 months ago

Hello, when i started training for the aeslc dataset, its showing me the following error. Can u plz help me with this?

raise NonMatchingChecksumError(resource.url, tmp_path) tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact https://github.com/ryanzhumich/AESLC/archive/master.zip, downloaded to /home/tbvl/tensorflow_datasets/downloads/ryanzhumich_AESLC_archive_masterACSpoxw627Ay4UrkswMeyz6RrOey8kKfkhEM4VySJWU.zip.tmp.0ec5f533500544b9b901328f413cbb6b/master, has wrong checksum.

Screenshot (600)

google-research / pegasus

Time for Training #102