Variational RHN + WT (depth=10) with 517 units per layer is enough vs original 830

jzilly / RecurrentHighwayNetworks

Recurrent Highway Networks - Implementations for Tensorflow, Torch7, Theano and Brainstorm

MIT License

404 stars 70 forks source link

Variational RHN + WT (depth=10) with 517 units per layer is enough vs original 830 #17

Open wenwei202 opened 6 years ago

wenwei202 commented 6 years ago

The homogeneity of RHNs ease us to learn sparse structures within RHNs. In our recent work of ISS (https://arxiv.org/pdf/1709.05027.pdf), we find that the we can reduce "#Units/Layer" of "Variational RHN + WT" in your Table 1 from 830 to 517 without losing perplexity. This reduces the model size from 23.5M to 11.1M, which is much smaller than the model found by "Neural Architecture Search". For your interests, the results are covered in Table 2 in our work.

Let us know if this is interesting to you.

jzilly commented 6 years ago

Dear wenwei202,

Does this also imply that it would be possible not to reduce model size and to improve performance instead?

Thank you for making us aware of this. I will have a look at the paper.

wenwei202 commented 6 years ago

@julian121266 That is a good point. Current finding is it can slightly improve the performance to 67.5/65.0 using a smaller size of 726 as shown in table 2. Let me check if we can improve more starting from a larger model and compressing them. BTW, did you try model size beyond 830 for your RHNs with depth 10? If it didn't improve performance, was it because it's more difficult to optimize?

jzilly commented 6 years ago

@wenwei202 We had similar findings. Optimization was fine. The model simply did not generalize much better. In fact depth 8 ended up working slightly better than depth 10. Most likely the relationship is submodular with diminishing returns for increased depth. A new iteration on the RHN idea was actually published half a year later: https://arxiv.org/abs/1705.08639