tencent-ailab / hok_env

Honor of Kings AI Open Environment of Tencent
https://aiarena.tencent.com/aiarena/en/open-gamecore
Apache License 2.0
590 stars 68 forks source link

gamecore error after training 5~6 hours #23

Closed zszzlmt closed 10 months ago

zszzlmt commented 1 year ago

Hello, I trained the model according to this md: https://github.com/tencent-ailab/hok_env/blob/master/docs/run_with_prebuilt_image.md Every thing w POPO20230201-170732 ent well, but the gamecore crashed after 5~6 hours training with error message: "wait cmd failed".

Please help me find the reason, and is there any way to load the ckpt trained and continue the training? Thanks!

hongyangqin commented 1 year ago

Please check the disk usage, the memory usage, and the log of simulator at gamecore/simulator_output.

Kim-Q commented 1 year ago

Pardon me, does it (https://github.com/tencent-ailab/hok_env/blob/master/docs/run_with_prebuilt_image.md ) really work?

I've been trying on '5.i. Start learner' process and it always stuck in the address, although I have switched the address as it designed. It still runs like:

root@f2b39780a63b:/workspace/hok_env/hok_env/code/code/gpu_code/script# bash start_gpu.sh

[2023-05-25 09:35:54] init dir start_gpu.sh: line 16: cd: /code/code/gpu_code/script/: No such file or directory cp: cannot create regular file '/code/code/gpu_code/learner/tool/': No such file or directory cp: cannot create regular file '/code/code/gpu_code/learner/tool/': No such file or directory cp: cannot create regular file '/code/code/gpu_code/learner/tool/': No such file or directory [2023-05-25 09:35:54] start run set_gpu.sh start_gpu.sh: line 26: cd: /code/code/gpu_code/learner/tool/: No such file or directory start_gpu.sh: line 31: cd: /code/code/gpu_code/learner/: No such file or directory bash: kill.sh: No such file or directory bash: clean.sh: No such file or directory start_gpu.sh: line 33: cd: /code/code/gpu_code/learner/: No such file or directory [2023-05-25 09:35:54] start rl learner start_gpu.sh: line 39: cd: /code/code/gpu_code/script/: No such file or directory [2023-05-25 09:35:54] start monitor

I'd appreciate it if somebody gives me the latest tutorial fits for the hok-1.2.1.

Kim-Q commented 1 year ago

Besides, I also tried to modify the address in shell. Some process seem not work for the 'kill' reports an error.

root@f2b39780a63b:/workspace/hok_env/hok_env/code/gpu_code/script# bash start_gpu.sh

[2023-05-25 08:51:51] init dir [2023-05-25 08:51:51] start run set_gpu.sh [main] task_name=test_auto training_type=async steps=157 game_name=1v1 env_name=1v1 use_zmq=0 use_socket=0 mem_pool_num= task_id=130432 task_uuid=e32ae3a6 task_id= task_uuid= howe start set_gpu0 gpu_num, 1 actor_num, gpu_ips, 127.0.0.1 gpu_list, "127.0.0.1:35911" Nodelist, 127.0.0.1:1 kill: (9177): No such process [2023-05-25 08:51:52] start rl learner Start service... Complete! [2023-05-25 08:51:52] start monitor

hongyangqin commented 1 year ago

@Kim-Q I noticed that your work directory is /workspace/hok_env/hok_env/code/gpu_code/script. It seems like you're not using the pre-built image which requires /code/code/gpu_code/script as the working directory.

Kim-Q commented 1 year ago

Right. Because I have modified the work directory and the errors still exist. See https://github.com/tencent-ailab/hok_env/issues/23#issuecomment-1562623849.

I think it might be the version problem, for the 'hok_env' download files are not arranged like the work directory in shell files (start_gpu.sh, set_gpu.sh, start.sh, etc.).