tencent-ailab / hok_env

Honor of Kings AI Open Environment of Tencent
https://aiarena.tencent.com/aiarena/en/open-gamecore
Apache License 2.0
626 stars 72 forks source link

hardware support issues with gamecore. #61

Open FDUAI opened 3 months ago

FDUAI commented 3 months ago

I am experiencing an issue when running the gamecore container. I followed the official installation and configuration instructions and successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs. Despite keeping all the environment variables the same as on the successfully configured machines, gamecore still fails to run in the container on this server. Furthermore, it also fails to run on the Windows system of this server. I suspect that the issue might be related to the underlying hardware, or perhaps gamecore has issues with some library calls to the hardware. The first image below shows running gamecore directly in the official container using wine, the second image shows running it using the official provided sh script, and the third image shows the output log of gamecore.

image-20240614111149683 image-20240614104629475 image-20240614104644241

Starlight0798 commented 3 months ago

How did you run gamecore in Linux?I failed to build docker image of gamecore because of wine signature

TimerChen commented 3 months ago

Please see the solution in #59. Maybe the docker does not forward port 35150~35151.

FDUAI commented 3 months ago

How did you run gamecore in Linux?I failed to build docker image of gamecore because of wine signature

Perhaps you can build the image on another machine and then import it.

FDUAI commented 3 months ago

Please see the solution in #59. Maybe the docker does not forward port 35150~35151.

I tried, but it didn't work because my client and agent can communicate with each other. My output shows that communication is possible, which is different from what would happen with a port issue.

TimerChen commented 3 months ago

Is this error log for the test script test_env.py? Can you provide the content of xxx.conf.new?

Also, I write a Chinese tutorial for running the game environment on Linux without docker. https://zhuanlan.zhihu.com/p/704549313 I hope this will help you figure out this issue. I will also update an English tutorial in this git repo soon.

FDUAI commented 3 months ago

Is this error log for the test script test_env.py

1.Is this error log for the test script test_env.py?----> yes

2.Can you provide the content of xxx.conf.new? kaiwu-test-env-1719230614749757468-261.conf.new {"game_mode": "1v1", "abs_file": "../../rl_framework/gamecore/scene/1V1.abs", "core_assets": "../../rl_framework/gamecore/core_assets", "game_id": "kaiwu-test-env-1719230614749757468-261", "hero_conf": [{"hero_id": 157, "request_info": {"ip": "127.0.0.1", "port": 35150, "timeout": 3000 }, "skill_id": 80115, "symbol": [1514, 1514, 1514, 1514, 1514, 1514, 1514, 1514, 1514, 1514, 3515, 3515, 3515, 3515, 3515, 3515, 3515, 3515, 3515, 3515, 2520, 2520, 2520, 2520, 2520, 2520, 2520, 2503, 2503, 2503]}, {"hero_id": 154, "request_info": {}, "skill_id": 80115, "symbol": [1504 , 1504, 1504, 1504, 1504, 1504, 1504, 1504, 1504, 1520, 3514, 3514, 3514, 3514, 3514, 3514, 3514, 3514, 3514, 3514, 2517, 2517, 2515, 2515, 2515, 2515, 2515, 2515, 2515, 2515]}]} 3.I used the following command to run gamecore docker run -d --name gamecore0 --network host -e SIMULATOR_USE_WINE=1 -it gamecore sh /rl_framework/remote-gc-server/run_and_monitor_gamecore_server.sh Even when I run it on the Windows 11 system of this machine, the same bug occurs. This image is the output when running test_env.py. It's clear that it did not receive a return message from gamecore. 图片

TimerChen commented 3 months ago

Sorry, I still have no idea what's the reason. There are some directions you can go to troubleshoot the problem:

Also, I am confused about your testing environment.

... successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs.

Even when I run it on the Windows 11 system of this machine, the same bug occurs.

Does the error only occur when running the tests with docker? Do both gamecore_server and test_env.py run in the same docker container?

FDUAI commented 3 months ago

Sorry, I still have no idea what's the reason. There are some directions you can go to troubleshoot the problem:

* Maybe you can try to change the default port in `unit_test/test_env.py` from 35150 to 35300.

* Is there a program occupying port 35150?

* Is there a firewall blocking the port?

Also, I am confused about your testing environment.

... successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs.

Even when I run it on the Windows 11 system of this machine, the same bug occurs.

Does the error only occur when running the tests with docker? Do both gamecore_server and test_env.py run in the same docker container?

Sorry, I still have no idea what's the reason. There are some directions you can go to troubleshoot the problem:

* Maybe you can try to change the default port in `unit_test/test_env.py` from 35150 to 35300.

* Is there a program occupying port 35150?

* Is there a firewall blocking the port?

Also, I am confused about your testing environment.

... successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs.

Even when I run it on the Windows 11 system of this machine, the same bug occurs.

Does the error only occur when running the tests with docker? Do both gamecore_server and test_env.py run in the same docker container?

I tried all three directions you suggested, but the error remains the same. This error is not limited to Docker; it also occurs when I use Wine directly. The second screenshot in the question shows the error, indicating an issue with a low-level random number generation function. Therefore, I suspect it might be a hardware issue since the error also occurs on the Windows system. I'm using an AMD EPYC 9754, which may not support this function or has some restrictions. Currently, I can only run the actor and learner nodes on this machine, while the gamecore node runs on other machines.

Thank you for you help!

FDUAI commented 3 months ago

Sorry, I still have no idea what's the reason. There are some directions you can go to troubleshoot the problem:

* Maybe you can try to change the default port in `unit_test/test_env.py` from 35150 to 35300.

* Is there a program occupying port 35150?

* Is there a firewall blocking the port?

Also, I am confused about your testing environment.

... successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs.

Even when I run it on the Windows 11 system of this machine, the same bug occurs.

Does the error only occur when running the tests with docker? Do both gamecore_server and test_env.py run in the same docker container?

Sorry, I still have no idea what's the reason. There are some directions you can go to troubleshoot the problem:

* Maybe you can try to change the default port in `unit_test/test_env.py` from 35150 to 35300.

* Is there a program occupying port 35150?

* Is there a firewall blocking the port?

Also, I am confused about your testing environment.

... successfully ran gamecore on both a Linux server and a Windows PC. However, I am unable to successfully run gamecore in a container on an Ubuntu 22.04 machine with dual 9754 CPUs.

Even when I run it on the Windows 11 system of this machine, the same bug occurs.

Does the error only occur when running the tests with docker? Do both gamecore_server and test_env.py run in the same docker container?

I did not place the gamecore and actor nodes in the same Docker container. This setup should work fine because it works on other machines. The firewall has also been turned off.

TimerChen commented 3 months ago

Given all these trials, I agree this is a hardware issue. This is beyond my ability. Waiting for official personnel to handle it.