Implementation details - Githubissues

Hi, thank you for releasing this implementation! I have several questions regarding how this works relative to my understanding of the original paper:

Why is the action always the mode (i.e. argmax of the logits) of the policy each iteration, rather than sample from the policy? The idea behind policy gradients it that you sample an action from a policy. Is this because batch size is 1, so rather than sampling you just take the mode?
You also take a random/exploratory action with probability epsilon. However, the logprob (and thus loss) of exploratory actions seem to be computed incorrectly. Your loss uses self.policy_classifiers to compute the logprob; however, these tensors are conditioned on the argmax of the policy, not the action that was actually taken. The loss should always use the logprob of the action taken (in this case the exploratory one), not the argmax of the policy.
Each time you generate a list of actions (i.e. a single architecture), you start by feeding in the state from the previous architecture (L78 of train.py). Is this correct, and if so what is the reason for this? Each iteration the controller has new parameters and will thus emit different architectures. Feeding the controller the previous state as
I don't understand the discounting, or rather, why discounting is performed at all. A single architecture gives you a single reward, in a sort of "1-step" episode process.

Based on 3 and 4, it seems like you view the entire process as a single RL "episode," as opposed to each architecture comprising its own episode. Is this correct?

I understand the original paper is very unclear on a lot of these points, so I appreciate any feedback or discussion!

So I'll preface this with two things, one that this would be my first foray into RL and that the paper was woefully inadequate in implementation details.

1) My understanding from the paper is that during evaluation, they use argmax to select the actual action (in this case cell type or filter count etc). I considered using the TF.multinomial to sample during training and then test with argmax, but eventually gave up on the idea as it would add significant complexity. In hindsight, it makes sense to do this, but I got somewhat stable results without it.

2) Didn't know this was an issue. In hindsight, this is what I do (I think) in my Progressive NAS implementation, so it should be done here as well. I would appreciate a PR.

3) This is from the paper, the state is chained in for each classifier to make it semantically dependent on the previous prediction. (at least that's the idea)

4) Discounting was just something I found to stabilise training a lot. If I remember the theory, the mathematical requirement is to make the infinite series converge eventually. In the case of code implementation, I just found it to help training significantly. NAS was highly unstable without it. So that's my empirical explanation. I would like if someone more knowledgeable will explain in detail why and how it helps here

Yes I modeled the entire thing as episodic memory style training, cause that's what I was familiar with (my knowledge is mainly from Deep Minds papers on AlphaGo, Atari DQNs etc so I transfered what I knew to this).

Of course, the original authors might have used something more advanced (I am pretty sure they did), and therefore their implementation would be vastly superior to mine.

Thanks for the quick response! And again, I understand on the serious lack of details in the paper! My background is more on the RL side, but I'm new to optimizing NN architectures. I agree with your points on (1) and (2).

Regarding (1), yes, in RL you typically take the mode during evaluation, but sample during training. The policy gradient theorem itself depends on having actions sampled from a stochastic policy. However, for batch size = 1, I see how it might stabilize training.

Regarding (3), the state is chained when selecting the different architecture components. If you view one architecture as a single RL episode, there are multiple actions in that episode which correspond to the multiple building blocks of the architecture. Within an episode, the policy depends on the previous actions. However, in your implementation, the start of one architecture is also chained with the end of the previous architecture. I don't think the paper suggests this part. I also don't see the intuition. The RNN controller defines a distribution over architectures. You should be able to take i.i.d. samples from this distribution.

The author's ENAS implementation (which has more to it but the basics are the same as NAS) uses the variable self.g_emb (there are no comments; I assume "emb" is for embedding but I have no idea what "g" refers to) as the input each time a new architecture is sampled. self.g_emb is actually a trainable parameter; however, from what I see there is no feeding the final state of the previous architecture into the first state of the subsequent architecture. In other words, there are within-architecture dependencies but not between-architecture dependencies.

Again, it seems like there could be a difference in viewing the entire procedure as one long episode (with each action corresponding to an architecture), versus each architecture comprising an episode (with each action corresponding to a building block of an architecture). This also explains why (4) didn't make sense to me. I don't see any discounting in the ENAS code but I haven't looked at everything.

Hmm the comments on (3) and (4) are quite interesting. I think in this code, if each architecture starts with a zero state in the beginning, and I don't sample the first state, so if I used zero state each time, I would get the same architecture each "sample"

It makes sense to sample the first cell, and then chain the remainder as ENAS does it.

This is something stemming from my decision to not use sampling during training to reduce complexity. Due to this, at test time, I was forced to chain multiple architectures together, otherwise starting from a zero state vector each time would lead to the same architecture being sampled repeatedly.

I kind of moved onto progressive nas since it's SMBO methodology of training is something I understand a bit better.

titu1994 / neural-architecture-search

Implementation details #13