polaris-sc2 / polaris-training

For code etc relating to the network training process.
Apache License 2.0
12 stars 2 forks source link

Define Neural Network I/O for 4.7.1 replay data. #9

Open Error323 opened 5 years ago

Error323 commented 5 years ago

This issue is focused on the 4.7.1 version of SC2 and its corresponding replay data. Our first objective is to train a CNN encoder -> LSTM -> CNN decoder network from the data that is able to defeat the default A.I. and plays on all maps/races.

In order to achieve this we need to formalize the input and output of the network so that we can start implementing the various components (replay parser, trainingpipeline, agent).

Error323 commented 5 years ago

From https://openreview.net/pdf?id=HkxaFoC9KQ (p7, p16)

The input:

The output:

Please provide feedback!

Matuiss2 commented 5 years ago

I think effects should be on inputs(f.e storms, colossi and lurker attacks)

inoryy commented 5 years ago

I have a general idea on how to do it as autoregressive policy, but still quite fuzzy on the embedded approach.

Error323 commented 5 years ago

Thanks @inoryy that's useful. I found this article on embeddings, https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 that seems to explain it well. Its a way to reduce dimensionality of categories into a smaller continuous space. Reminds me of PCA.

inoryy commented 5 years ago

There's a bit of terminology clash here, the embedded policy vector is unrelated to embeddings. Though understanding those is also useful because they're extensively used to process inputs (there's a bunch of categorical spatial features).

inoryy commented 5 years ago

Duplicating my thoughts from Discord.

The relevant part of the article is this: screenshot from 2019-02-07 13-37-26

Now that I think about maybe they do mean categorical embeddings. So the pipeline would be action id sample -> embedding from ~1700 levels (number of unique action ids) down to 16 dim -> sample args from those 16 dims. But if that's the case I've never seen it done this way before. Also have to be careful propagating gradients with this setup, might need Gumbel-Softmax trick since action id sampling is part of the computation.