henryhungle / NADST

Code for the paper Non-Autoregressive Dialog State Tracking (ICLR20)
MIT License
44 stars 5 forks source link

Some confusion about neural network architecture in NADST #8

Closed YourThomasLee closed 3 years ago

YourThomasLee commented 3 years ago
  1. I have noticed that the fertility encoder took cross-entropy loss as the optimization objective while the KL divergence loss was taken as the optimization objective of the state decoder. is there any consideration for this loss selection?
henryhungle commented 3 years ago

Hi @YourThomasLee , sorry the late response. I chose to have the fertility take cross-entropy loss as optimization objective as I notice the training process is very similar to one using KL divergence loss. This can be due to the small vocabulary set in fertility decoder (e.g. limited range of numerical values) and hence, using KL divergence loss does not benefit much to improve the training. In state decoder, the vocabulary set is much larger (e.g. tokens) and using KL divergence with smooth labelling can reduce overfitting to some extent.

YourThomasLee commented 3 years ago

Thanks a lot!

  1. Actually, I have tested some different combinations of $\alpha,\beta$ for fertility prediction loss and state value prediction loss and it seems like there is still room for improvement. Thus, I want to make sure whether the combination of $\alpha=1,\beta=1$ is the best result for model training(I am afraid that everything I did was that you have done, and the result I got is not reliable).
  2. When I try to reproduce the reported result, I find the random seed is essential. Often the case, I got the trained model with joint accuracy 45~47%. If possible, could you tell me how you get your best-trained model, did you set a fixed random seed in the program?
  3. I find that in NADST, there is no module for representation learning of context/dialogue history. Is there any consideration for model design?
  4. In Transformer, multi-headed attention is usually followed by a position-wise forward layer, but in NADST, the model is designed as follows. $$ domain-slot pair=Attention(domain-slot pair,domain-slot pair,domain-slot pair)\rightarrow domain-slot pair=Attention(domain-slot pair, delex-context, delex-context)\rightarrow domain-slot pair=Attention(domain-slot pair, context, context)\rightarrow position-wise-forward-layer(domain-slot pair) $$ Actually, there is no position-wise forward layer between the two attention layers. I would appreciate it if you can explain it or provide some material/papers about the design mechanism.

Thanks very much for your replies!!!!!

henryhungle commented 3 years ago

hi @YourThomasLee,

  1. Actually, I have tested some different combinations of $\alpha,\beta$ for fertility prediction loss and state value prediction loss and it seems like there is still room for improvement. Thus, I want to make sure whether the combination of $\alpha=1,\beta=1$ is the best result for model training(I am afraid that everything I did was that you have done, and the result I got is not reliable).

The finetuning of \alpha and \beta during our project might not be optimal and it is possible that you can find better combination. I am just curious if you can share which combination you have is the best?

  1. When I try to reproduce the reported result, I find the random seed is essential. Often the case, I got the trained model with joint accuracy 45~47%. If possible, could you tell me how you get your best-trained model, did you set a fixed random seed in the program?

The random seed is not fixed actually. It is possible that the model performance fluctuates due to different initilizations. One possible approach you can explore is to use pretrained word embeddings, so that the models are not trained from scratch completely.

  1. I find that in NADST, there is no module for representation learning of context/dialogue history. Is there any consideration for model design?

The current model only encodes context/dialogue history through simple word embeddings with positional encoding since our main goal is to reduce system latency. It is open to other more sophisticated encoding technique.

  1. In Transformer, multi-headed attention is usually followed by a position-wise forward layer, but in NADST, the model is designed as follows. $$ domain-slot pair=Attention(domain-slot pair,domain-slot pair,domain-slot pair)\rightarrow domain-slot pair=Attention(domain-slot pair, delex-context, delex-context)\rightarrow domain-slot pair=Attention(domain-slot pair, context, context)\rightarrow position-wise-forward-layer(domain-slot pair) $$ Actually, there is no position-wise forward layer between the two attention layers. I would appreciate it if you can explain it or provide some material/papers about the design mechanism.

Actually there should be position-wise forward layer between attention layers in the current system, which is in the following code: https://github.com/henryhungle/NADST/blob/afdc1d1f7ecb855b03933e441c0b2fcefbc28feb/model/modules.py#L43-L65 This function is called after each attention step here: https://github.com/henryhungle/NADST/blob/afdc1d1f7ecb855b03933e441c0b2fcefbc28feb/model/modules.py#L84-L106

YourThomasLee commented 3 years ago
  1. Thanks for your detail replies! Actually, I tried to fix the random-seed throgh following codes,
    def seed_torch(seed=1029):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed) 
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # if you are using multi-GPU.
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

    but so disappointed, even the random seed is fixed, the model training still can not be reproducible. So, I don't know whether the combination of $\alpha, \beta$ is better or not. It would be appreciated if you can provide advice.

  2. Thanks for your explanation! To my comprehension of this part, a sublayer consists of four parts, I. domain-slot pairs self-attention II. domain-slot pairs to dialogue history attention III. domain-slot pairs to delexical history attention IV. position-wise forward layer. Indeed, I want to know the design mechanism of that there is no position-wise forward layer in I, II, and II, III.