Clarification about variables in the code

I was having some trouble understanding what all of the input variables are and was hoping explanations could be provided.

This is from text_cloze.py, with comments annotated by my understanding/questions

    # input theano vars
    # these are just the images from the context panels
    in_context_fc7 = T.tensor3(name='context_images') 

    # unsure of what the bbmask contains vs the context_bb 
    in_context_bb = T.tensor4(name='context_bb')
    in_bbmask = T.tensor3(name='bounding_box_mask')

    # is in_context the actual text from the context panels?
    in_context = T.itensor4(name='context')

    # what is this mask vs the bb mask?
    in_cmask = T.tensor4(name='context_mask')

    # are these the image and bb for the answer panel?
    in_answer_fc7 = T.matrix(name='answer_images')
    in_answer_bb = T.matrix(name='answer_bb')

    # I see that answers is of shape 3 x max_words, where 3 is the num of context panels, but what do the numbers in this tensor mean?
    # when I printed it out, it looked like
 # [[[ 5547    17  1547 ...     0     0     0]
 # [  776 20000 20000 ...     0     0     0]
 # [  102     4    13 ...     0     0     0]]
   in_answers = T.itensor3(name='answers')

    # what is the mask for?
    in_amask = T.tensor3(name='answer_mask')

    # the labels indicate which answers are the correct ones 
    in_labels = T.imatrix(name='labels')

Hello. Might be a bit late to this,

    # input theano vars
    in_context_fc7 = T.tensor3(name='context_images') # bsz x 3 x 4096 (because 3 context panels, fc7 features each of dim 4096)
    in_context_bb = T.tensor4(name='context_bb') # bsz x 3 x 3 x 4 (because 3 context panels, each contains a max of 3 speech boxes, each box described by 4 coordinates) 
    in_bbmask = T.tensor3(name='bounding_box_mask') # bsz x 3 x 3 (because 3 context panels, each contains a max of 3 speech boxes, the mask has an entry of 1 in the ith position if the panel contains the ith speech box)
    in_context = T.itensor4(name='context') # bsz x 3 x 3 x 30 (because 3 context panels, each contains a max of 3 speech boxes, each box contains speech with a max of 30 words)
    in_cmask = T.tensor4(name='context_mask') # bsz x 3 x 3 x 30 (because 3 context panels, each contains a max of 3 speech boxes, each box contains speech with a max of 30 words, where the mask has an entry of 1 in the ith position if the ith word exists in the speech)
    in_answer_fc7 = T.matrix(name='answer_images') # bsz x 4096 (fc7 feature for the panel for which we want to guess the speech)
    in_answer_bb = T.matrix(name='answer_bb') # bsz x 4 (the answer panel has one speech box described by 4 coordinates)
    in_answers = T.itensor3(name='answers') # bsz x 3 x 30 (3 candidate answers each of max 30 words )
    in_amask = T.tensor3(name='answer_mask') # bsz x 3 x 30 (mask for 3 candidates answers, ie, an entry of 1 in the ith position if the ith word exists in the candidate)
    in_labels = T.imatrix(name='labels') # bsz x 3 (out of 3 candidate answers, the correct answer will have a 1)

miyyer / comics

Clarification about variables in the code #2