mini-AlphaStar v1.04版本编译

您好，我又来了，前一段时间一直在做别的事情，最近打算赶紧研究下mini-AlphaStar。让人惊喜的是mini-AlphaStar已经更新到1.04了！然后我按照readme操作了一番，遇到了如下的问题：

（1）我按照知乎里说的，下载了与我的StarCraftII相同版本的回放文件，然后将回放文件复制到 ./data/replays/，然后运行 transform_replay_data.test，结果在 ./data/replay_data_tensor_new文件夹内没有生成任何文件，并输出如下信息

8bf226bac73e87f9996a71.SC2Replay
replay_path: /home/zhq/Doctor/RL_Project/mini-AlphaStar/data/replays/ffe17a49ecefd2396f634545d63751a0392c3300b81b647dbf4b7285da6925e1.SC2Replay
replay_path: /home/zhq/Doctor/RL_Project/mini-AlphaStar/data/replays/ffe4d74a07f04da3f137d6fd967d25ee31f08f1623d6ae20f663809997660b92.SC2Replay
replay_path: /home/zhq/Doctor/RL_Project/mini-AlphaStar/data/replays/ffea13e5c795065700f7aee8f7427c0fb417e09acdfffe09f9c9b2aab5570f6d.SC2Replay
replay_path: /home/zhq/Doctor/RL_Project/mini-AlphaStar/data/replays/fff40a57d7de2b96da6cb7eb42064c2daf694cbc26c848335ae641f099bac6fa.SC2Replay
100%|███████████████████████████████████████████████████████████████| 200687/200687 [00:02<00:00, 95753.21it/s]
unable to parse websocket frame.
RequestQuit command received.
Closing Application...
end
replay_length_list: []
noop_length_list: []
run over

(2) 在步骤（1）中文件转换没有成功，然后我对 ./data/replays/ 中的文件进行过滤，只保留人族v人族的回放文件，并将过滤后的回放文件保存在 ./data/filtered_replays_1/ 中。然后运行 transform_replay_data.test，结果 replay_data_tensor_new/文件夹中只生成了一个.pt文件。然后假设到目前为之没有错误，继续运行 sl_train_by_tensor.test，但是出了错误，输出如下:

(base) zhq@Ubuntu20:~/Doctor/RL_Project/mini-AlphaStar$ python run.py 
pygame 2.0.1 (SDL 2.0.14, Python 3.8.5)
Hello from the pygame community. https://www.pygame.org/contribute.html
run init
cudnn available
cudnn version 7605
==> Making model..
The number of parameters of model is 2638715
==> Preparing data..
  0%|                                                                                    | 0/1 [00:00<?, ?it/s]replay_path: ./data/replay_data_tensor_new/d55bf6b94d417716ca9caf36f065727efe93aa0059893e1000e9a5afb79a98a2.pt
100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.54it/s]
  0%|                                                                                    | 0/1 [00:00<?, ?it/s]replay_path: ./data/replay_data_tensor_new/d55bf6b94d417716ca9caf36f065727efe93aa0059893e1000e9a5afb79a98a2.pt
100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47127.01it/s]
Traceback (most recent call last):
  File "run.py", line 47, in <module>
    sl_train_by_tensor.test(on_server=P.on_server)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/sl/sl_train_by_tensor.py", line 463, in test
    main_worker(DEVICE)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/sl/sl_train_by_tensor.py", line 161, in main_worker
    val_set = ConcatDataset(val_list)
  File "/home/zhq/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 200, in __init__
    assert len(datasets) > 0, 'datasets should not be an empty iterable'  # type: ignore
AssertionError: datasets should not be an empty iterable

(3) 直接运行rl_vs_computer_wo_replay.test，则提示如下错误信息

(base) zhq@Ubuntu20:~/Doctor/RL_Project/mini-AlphaStar$ python run.py 
pygame 2.0.1 (SDL 2.0.14, Python 3.8.5)
Hello from the pygame community. https://www.pygame.org/contribute.html
run init
cudnn available
cudnn version 7605
initialed player
initialed teacher
learner trajectories size: 0
learner trajectories size: 0
start_time before training: 2021-12-07 10:26:38
map name: AbyssalReef
player.name: MainPlayer
player.race: Race.protoss
Version: B60604 (SC2.4.01)
Build: May  1 2018 19:24:12
Command Line: '"/home/zhq/StarCraftII/Versions/Base60321/SC2_x64" -listen 127.0.0.1 -port 18844 -dataDir /home/zhq/StarCraftII/ -tempDir /tmp/sc-k2b9f5n_/ -dataVersion 33D9FE28909573253B7FC352CE7AEA40'
Starting up...
Startup Phase 1 complete
learner trajectories size: 0
Startup Phase 2 complete
Creating stub renderer...
Listening on: 127.0.0.1:18844
Startup Phase 3 complete. Ready for commands.
learner trajectories size: 0
Requesting to join a single player game
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface enabled
Configure: score interface enabled
Configure: render interface disabled
Entering load game phase.
Launching next game.
Next launch phase started: 2
Next launch phase started: 3
Next launch phase started: 4
Next launch phase started: 5
Next launch phase started: 6
Next launch phase started: 7
Next launch phase started: 8
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
Game has started.
Sending ResponseJoinGame
start_time before reset: 2021-12-07 10:26:46
total_episodes: 1
start_episode_time before is_final: 2021-12-07 10:26:46
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [672,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [673,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [674,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [675,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [676,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [677,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [678,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [679,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [680,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [681,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [682,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [683,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [684,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [685,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [686,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [687,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [688,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [689,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [690,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [691,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [692,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [693,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [694,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [695,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [696,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [697,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [698,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [699,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [700,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [701,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [702,0,0] Assertion `val >= zero` failed.
/opt/conda/conda-bld/pytorch_1616554788289/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:197: sampleMultinomialOnce: block: [0,0,0], thread: [703,0,0] Assertion `val >= zero` failed.
unable to parse websocket frame.
RequestQuit command received.
Closing Application...
ActorLoop.run() Exception cause return, Detials of the Exception: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/rl_vs_computer_wo_replay.py", line 175, in run
    player_step = self.player.agent.step_logits(home_obs, player_memory)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/alphastar_agent.py", line 197, in step_logits
    action, action_logits, new_state, select_units_num = self.step_nn(observation=obs, last_state=last_state)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/alphastar_agent.py", line 164, in step_nn
    action_logits, action, hidden_state, select_units_num = self.agent_nn.action_logits_by_state(state, single_inference=True,
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/agent.py", line 278, in action_logits_by_state
    action_logits, actions, new_state, select_units_num = self.model.forward(state, batch_size = batch_size,
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/arch_model.py", line 128, in forward
    target_location_logits, target_location = self.location_head(autoregressive_embedding, action_type, map_skip)
  File "/home/zhq/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/location_head.py", line 287, in forward
    location_id = torch.multinomial(target_location_probs, num_samples=1, replacement=True)
RuntimeError: CUDA error: device-side assert triggered

Actor stops!
Learner also stops!
run over

(4) 运行 rl_train_with_replay.test进行 rl训练，则提示如下错误


(base) zhq@Ubuntu20:~/Doctor/RL_Project/mini-AlphaStar$ python run.py 
pygame 2.0.1 (SDL 2.0.14, Python 3.8.5)
Hello from the pygame community. https://www.pygame.org/contribute.html
run init
No models are found!
No models are found!
learner trajectories size: 0
learner trajectories size: 0
start_time before training: 2021-12-07 11:09:13
map name: AbyssalReef
player.name: MainPlayer
opponent.name: MainPlayer
player.race: Race.protoss
opponent.race: Race.protoss
Version: B60604 (SC2.4.01)
Build: May  1 2018 19:24:12
Command Line: '"/home/zhq/StarCraftII/Versions/Base60321/SC2_x64" -listen 127.0.0.1 -port 24993 -dataDir /home/zhq/StarCraftII/ -tempDir /tmp/sc-2o9j_vnz/'
Starting up...
Startup Phase 1 complete
learner trajectories size: 0
Startup Phase 2 complete
Creating stub renderer...
Listening on: 127.0.0.1:24993
Startup Phase 3 complete. Ready for commands.
learner trajectories size: 0
Version: B60604 (SC2.4.01)
Build: May  1 2018 19:24:12
Command Line: '"/home/zhq/StarCraftII/Versions/Base60321/SC2_x64" -listen 127.0.0.1 -port 16658 -dataDir /home/zhq/StarCraftII/ -tempDir /tmp/sc-k86c8jn9/'
Starting up...
Startup Phase 1 complete
learner trajectories size: 0
Startup Phase 2 complete
Creating stub renderer...
Listening on: 127.0.0.1:16658
Startup Phase 3 complete. Ready for commands.
learner trajectories size: 0
Requesting to join a multiplayer game
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface enabled
Configure: score interface enabled
Configure: render interface disabled
Requesting to join a multiplayer game
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface enabled
Configure: score interface enabled
Configure: render interface disabled
Entering load game phase.
Create game with map path: AbyssalReefLE.SC2Map
Entering load game phase.
learner trajectories size: 0
Attempting to join a game with the map: Ladder2017Season4/AbyssalReefLE.SC2Map
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
learner trajectories size: 0
Game has started.
Sending ResponseJoinGame
Game has started.
Sending ResponseJoinGame
start_time before reset: 2021-12-07 11:09:24
total_episodes: 1
learner trajectories size: 0
Version: B60604 (SC2.4.01)
Build: May  1 2018 19:24:12
Command Line: '"/home/zhq/StarCraftII/Versions/Base60321/SC2_x64" -listen 127.0.0.1 -port 19791 -dataDir /home/zhq/StarCraftII/ -tempDir /tmp/sc-5ebl19qe/ -dataVersion 33D9FE28909573253B7FC352CE7AEA40 -eglpath libEGL.so'
Starting up...
Startup Phase 1 complete
learner trajectories size: 0
Startup Phase 2 complete
Attempting to initialize EGL from file libEGL.so ...
Successfully loaded EGL library!
Successfully initialized display on device idx: 0, EGL version: 1.5

Running CGLSimpleDevice::HALInit...
Calling glGetString: 0x7f7351fc85e0
Version: 4.6.0 NVIDIA 460.91.03
Vendor: NVIDIA Corporation
Renderer: GeForce GTX 1650/PCIe/SSE2
OpenGL initialized!
Disabling compressed textures
Listening on: 127.0.0.1:19791
Startup Phase 3 complete. Ready for commands.
learner trajectories size: 0
replay_path: /home/zhq/Doctor/RL_Project/mini-AlphaStar/data/filtered_replays_1/df4f689685d451cd2ae60f6e19af81dde8b812ddb42264cd1ae468e91796c677.SC2Replay
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface enabled
Configure: score interface enabled
Configure: render interface disabled
Launching next game.
Next launch phase started: 2
Next launch phase started: 3
Next launch phase started: 4
Next launch phase started: 5
Next launch phase started: 6
Next launch phase started: 7
Next launch phase started: 8
learner trajectories size: 0
learner trajectories size: 0
Starting replay 'TempStartReplay.SC2Replay'
learner trajectories size: 0
learner trajectories size: 0
Game has started.
learner trajectories size: 0
start_episode_time before is_final: 2021-12-07 11:09:31
unable to parse websocket frame.
RequestQuit command received.
Closing Application...
unable to parse websocket frame.
RequestQuit command received.
Closing Application...
RequestQuit command received.
Closing Application...
unable to parse websocket frame.
ActorLoop.run() Exception cause return, Detials of the Exception: probability tensor contains either `inf`, `nan` or element < 0
Traceback (most recent call last):
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/actor_plus_z.py", line 242, in run
    player_step = self.player.agent.step_logits(home_obs, player_memory)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/alphastar_agent.py", line 197, in step_logits
    action, action_logits, new_state, select_units_num = self.step_nn(observation=obs, last_state=last_state)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/rl/alphastar_agent.py", line 164, in step_nn
    action_logits, action, hidden_state, select_units_num = self.agent_nn.action_logits_by_state(state, single_inference=True,
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/agent.py", line 278, in action_logits_by_state
    action_logits, actions, new_state, select_units_num = self.model.forward(state, batch_size = batch_size,
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/arch_model.py", line 117, in forward
    action_type_logits, action_type, autoregressive_embedding = self.action_type_head(lstm_output, scalar_context)
  File "/home/zhq/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhq/Doctor/RL_Project/mini-AlphaStar/alphastarmini/core/arch/action_type_head.py", line 157, in forward
    action_type = torch.multinomial(action_type_probs.reshape(batch_size, -1), 1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Actor stops!
Learner also stops!
run over

liuruoze / mini-AlphaStar

mini-AlphaStar v1.04版本编译 #13