tech-srl / slm-code-generation

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)
https://AnyCodeGen.org
MIT License
86 stars 10 forks source link

Missing EOS? #12

Open Zadagu opened 3 years ago

Zadagu commented 3 years ago

Hi, I'm looking through your training data (the json representation). I found a instance where tokens are not followed by a EOS node.

  "targets": [
    "Nm",
    "idle,sources",
    "Cal",
    "Nm",
    "get",
    "EOS",
    "Nm",
    "i",
    "EOS",
    "EOS"
  ],
  "target_seq": "idleSources.get(i)",

Could you please elaborate why there is no EOS after "idle,sources" in this case?

urialon commented 3 years ago

Hi @Zadagu , Thank you for your interest in SLM.

I understand the confusion - the order of targets is not their sequential order. Specifically, it seems that idleSources can be copied as a full token from the context. Thus, the model will be trained to predict it by copying the entire token from another head_path.

Because it is a full token (marked with the comma in idle,sources), there is no need to predict an EOS after it.

Does that make sense? I hope it helps, Uri

Zadagu commented 3 years ago

Hi @urialon, thank you for your explanation. Might I ask one more question? I read your paper and I'm still not sure how s("idle,sources") is composed. Is it:

s("idle,sources") = \sum_{val(l)="idle,sources"} s_copy_token(l) + \sum_{val(l0)="idle"} s_copy_0(l) + \sum_{val(l1)="sources"} s_copy_1(l)

or is it just:

s("idle,sources") = \sum_{val(l)="idle,sources"} s_copy_token(l)
urialon commented 3 years ago

Hi @Zadagu , Sure, you can ask as many questions as you like :-)

Let me separate my answer into two parts:

Training

If idle,sources appears in the context, that is, it can be copied as a full token, it is just computed as your second option (s_copy_token(l)). That is, the model is trained to copy a full token. That is, to predict a single "target", without EOS. (there is a rare case that "idle,sources" is so frequent that it appears in the vocabulary. In this case, it also gets is vocabulary score).

If idle,sources does not appear in the context as a full token, then the model is trained to predict it as a sequence of multiple subtokens (there are 2 "targets", and an additional EOS). Each subtoken gets its score from the copy-subtoken pointer and its (possible) vocabulary score.

Beam Search

For simplicity, let's assume that we're doing beam search with a beam size of 1. But the same idea holds from a beam of any size.

During beam search, the model computes the score for each of "idle", "sources", and "idle,sources". Each of them gets a score from two sources: vocabulary score + copy score. "idle" and "sources" get their copy score from the sub-token copy pointer, while "idle,sources" get its copy score from the full-token copy pointer.

Then, we just take the argmax (or top-k if the beam is greater than 1). If the argmax is "idle,sources", beam search can consider this token as "done", without needing to predict EOS. If the argmax is "idle" or "sources", the model continues predicting subtokens until reaching an EOS.

Does that make sense? Feel free to ask any question. Uri

Zadagu commented 3 years ago

Thank you Uri. This makes sense.

Zadagu commented 3 years ago

Hi @urialon,
Thank you for giving me the opportunity to ask questions. :)

I wondering what are dimensions of \tilde{h}? I tried to observe it from the expression E^{type} * \tilde{h} but thats conflicts s_{gen}(w) = E^{subtoken} * \tilde{h} because E^{subtoken} has got a different width than E^{type}.

urialon commented 3 years ago

Hi @Zadagu , I agree that this is not clear from the paper. The dimensions of \tilde{h}, E^{type} and E^{subtoken} are all equal (they must be, as you noticed) and in our experiments, they all were equal to 512.

There is a minor inaccuracy in the paper, in the function e(n_i) (page 4, bottom left):

if n_i is a subtoken, its embedding is also concatenated with the subtoken's child id, where subtokens were numbered according to their order. For example, fileStatus is two subtokens: ['file', 'status'] having child ids [1,2].

I hope it helps :-) Uri

Zadagu commented 3 years ago

Hi @urialon,

thank you for the clearification. In the meantime I tried to fix my reimplementation according to your comments. But it ended up with a lot less parameters (9.1M) than yours (15M).

No I wonder, what I might have forgotten. My Model consits of:

For most of the weight matrices I'm already using a FC-Layer with bias. Could you please tell me, where I'm wrong?

Are you using a Positional Embedding like in "Attention is all you need" for the child ids? Or is this just another randomly initialized matrix?

Kind regargds, Zadagu

urialon commented 3 years ago

Hi @Zadagu , Sorry for the late reply.

I'm not sure why I did not mention in the paper, but I think that the source of discrepancy is as follows:

after the LSTM layers, I projected the states using an FC layer (+bias) to a dimension of 512. The projection matrix is the size of 256x512, and there is a different projection matrix for the "standard" paths and a different projection matrix for the "root path".

I think that I did that to help the model distinguishing between the "standard paths" (blue paths in Figure 2) and the "root path" (the orange path in Figure 2). I probably just meant to distinguish between them, and did not notice that I am increasing their dimensions.

So this adds some weights and increases the size of other tensors. For example, Ci must be 9x512x512, which is an increase of 1.5M parameters.

Sorry for this inconsistency, let me know how many parameters you are getting now.

Zadagu commented 3 years ago

Hi @urialon,

I'm very grateful for your reply. Thank you very much for looking into this.

After installing the projection layer, i got a total of 21.5M parameters. Are you using the projected paths embeddings everywhere (e.g. in the copy mechanism) or just in the transformer? That could be an explanation for the difference, because I use the bigger size everywhere.

Kind regargds, Zadagu

urialon commented 3 years ago

Hi Zadagu, Thank you for reporting this.

Yes, I am using the projected paths everywhere, including the copy mechanism, not only in the transformer.

But I found another possible difference: Due to GPU memory constraints, in the final model, I reduced the size of the feedforward layer of the transformer: Usually, the feedforward layer of the transformer has a hidden size of d X 4, so in your implementation it is probably 512x4. But since it kept exploding the GPU memory, I had to reduce this to d X 2, so the hidden size of the feedforward layer is only 1024.

Best, Uri

On Thu, Jun 17, 2021 at 3:08 PM Zadagu @.***> wrote:

Hi @urialon https://github.com/urialon,

I'm very grateful for your reply. Thank you very much for looking into this.

After installing the projection layer, i got a total of 21.5M parameters. Are you using the projected paths embeddings everywhere (e.g. in the copy mechanism) or just in the transformer? That could be an explanation for the difference, because I use the bigger size everywhere.

Kind regargds, Zadagu

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tech-srl/slm-code-generation/issues/12#issuecomment-863183883, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMFTF6G65AAJWEGE473TTHQTBANCNFSM45KTY3MQ .

Zadagu commented 3 years ago

Hi @urialon,

We are getting closer. Now I got 17.2M Parameters. :nerd_face: But I'm running out of ideas where the difference comes from.

Kind regards, Zadagu

urialon commented 3 years ago

Hi Zadagu,

Ci with a size of 9 x 256 x 256

In my implementation, the first dimension is 8 instead of 9 (that is: 8 x 512 x 512). How close are we now? Uri