tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
550 stars 165 forks source link

Question about batch dimension in build_training_graph function #55

Closed zaataylor closed 4 years ago

zaataylor commented 4 years ago

Hi @urialon, I had a quick question about the batch dimension used in the build_training_graph method. I'm new to ML/DL and Tensorflow, but was interested in seeing what research is like, and this seemed like a really cool project. I'm currently annotating the code for the entire project so I can understand how everything fits together.

I understand the concept of batches as used in training, but I'm confused about the batch dimension used in the code here:

# (batch, max_contexts, decoder_size)
batched_contexts = self.compute_contexts(subtoken_vocab=subtoken_vocab, nodes_vocab=nodes_vocab,
                                                     source_input=path_source_indices, nodes_input=node_indices,
                                                     target_input=path_target_indices,
                                                     valid_mask=valid_context_mask,
                                                     path_source_lengths=path_source_lengths,
                                                     path_lengths=path_lengths, path_target_lengths=path_target_lengths)

The reason I'm confused is because the input to this function, input_tensors, represents (based on what I understood) a single processed example from the dataset. So, I don't understand if the shape-related comments you added here represent an implicit batch dimension, meaning that when following the execution of one example during training, I shouldn't really think about that dimension and instead focus on the others OR if this is an explicit dimension. In the latter case, I'm confused as to how this is possible, due to my assumptions about the shape of the input_tensors parameter.

I'm sure you are busy with research, but I was hoping you might be able to explain. I'm sure I must be overlooking something simple.

urialon commented 4 years ago

Hi @zaataylor , Thank you for your interest in code2seq and for choosing this project as your first DL project :-)

input_tensors is a batch of examples, not a single example. It is a dictionary that maps strings->tensors, where every such tensor has an explicit batch dimension. Batching happens here: https://github.com/tech-srl/code2seq/blob/master/reader.py#L186 in the map_and_batch function in the reader, such that the tensors that arrive in the model are already batched.

I know that some TF tutorials ignore batching, and may represent every single example in a single tensor. The reason might be that the authors ignore batching for the simplicity of the tutorial. Of course, this makes their code very inefficient. Alternatively, some frameworks (like the newer tf.keras API) might have APIs where you can ignore the batch dimension and the framework does batching for you.

I hope it helps, let me know if you have any more questions.

zaataylor commented 4 years ago

Ah, I see! I'd overlooked the batch part of map_and_batch during my annotation.

So with that in mind, the iterator over the dataset that is created in reader.py is really emitting a batch of examples whenever sess.run() is called, rather than one example at a time like I was thinking. That makes sense.

I did have one more question though, just about the indexing used here: https://github.com/tech-srl/code2seq/blob/9a06b35575852b05246d06b1e7fe84c1b9242551/model.py#L333 in build_training_graph. I was trying my best to make sense of this, but was at a bit of a loss. I thought that the process_dataset function works with one example at a time and returns a tensor representing a dictionary with the same keys as the ones used here. But if input_tensors is a batch of examples, how does this key lookup work? Does TF essentially merge all of these returned keys into one key, and then indexing by that key is analogous to doing a SELECT <key> FROM over every tensor in the batch? I've been trying to find an example that uses dictionary indexing like that, but haven't had too much luck.

@urialon Thank you so much for answering my questions! I've learned so much through this project already! :)

urialon commented 4 years ago

Yeah, this is a little confusing, but actually quite useful:

TLDR: input_tensors is a dictionary whose keys are strings, and whose values are tensors.

This dictionary is created here: https://github.com/tech-srl/code2seq/blob/master/reader.py#L163 as a dictionary from "string" to "tensor".

When we call map_and_batch - TensorFlow is smart enough to apply the batching on each tensor separately. So we are left with the same keys (for example, the key 'TARGET_INDEX_KEY'), except that now each such key is assigned a "batched" value (a value that has a new, additional, 0-dimension, and represents a batch of examples rather than a single example).

Does that help?

zaataylor commented 4 years ago

Yep, that makes perfect sense! I was trying to look at the TF source and documentation to see if I could find details on the mechanism you described here, but must’ve been looking in the wrong places and/or didn’t have enough context to understand what was going on. I appreciate you clearing things up for me.

I don’t have any more code-related questions currently, but I’ll open a new issue later on if I do. I’m currently right at the part in model.py where contexts are sent to the decoder and attended over.

Once again, thank you for taking time out of your day to explain @urialon ! :)

urialon commented 4 years ago

Sure, let me know if you have additional questions. That dictionary-of-tensors mechanism is not well-documented anywhere, as far as I know.