what is the role of 'maxlen' parameter?

amirj commented 8 years ago

'maxlen' is one of the parameters in 'train_nmt.py', set to 50 by default. I get the following message during the training process: "Minibatch with zero sample under length 100" Investigating the source code shows that this message is appear when there is a batch size that the length of the source and target is greater than 'maxlen'. On the other hand, in 'data_iterator.py' training samples have been skipped when the length of source and target is greater than 'maxlen'.

Why such a contradiction is exist? -passing samples in data_iterator and then filter them in 'prpare-data'
If I set maxlen to a large value (1000 for example), the updating time is significantly increase, would you describe why?

orhanf commented 8 years ago

Thank you for pointing this out,

We have this functionality because prepare_data is used in two different places, with different behavior (although i agree that there is a redundancy in the filtering).
- In the training loop here, where we also pass maxlen to ensure we don't use large sequences, to save on computation time.
- Used by pred_probs to compute validation set log-likelihood, here, where we do not specify maxlen to consider all the samples in the validation set.
Basically, model spends a lot of time to do the forward and backward passes for longer sequences, and since we do not use truncated-BPTT (see truncate_gradient parameter in scan function here), we store all the activations in the forward pass to be used in the backward pass, which has an impact on computation time and the amount of memory that's being used. For longer sequences (like 1000 in your case) you might need to play with truncate_gradient parameter of scan.

hanskrupakar commented 8 years ago

I have the same problem but for me, the line shows a maxlen parameter lesser than what I want (I want 100 words but I'm only getting 15 as maxlen). I don't want my training to only be carried out for sentences of length 15.

def train(dim_word=100,  # word vector dimensionality
          dim=1000,  # the number of LSTM units
          encoder='gru',
          decoder='gru_cond',
          patience=10,  # early stopping patience
          max_epochs=5000,
          finish_after=100000000000000000000000,  # finish after this many updates
          dispFreq=100,
          decay_c=0.,  # L2 regularization penalty
          alpha_c=0.,  # alignment regularization
          clip_c=-1.,  # gradient clipping threshold
          lrate=0.01,  # learning rate
          n_words_src=65000,  # source vocabulary size
          n_words=50000,  # target vocabulary size
          maxlen=100,  # maximum length of the description
          optimizer='rmsprop',

hans@hans-Lenovo-IdeaPad-Y500:~/Documents/HANS/MAC/SUCCESSFUL MODELS/ADD/dl4mt-tutorial-master/session3$ ./train.sh 
Using gpu device 0: GeForce GT 650M (CNMeM is disabled, cuDNN 4007)
{'use-dropout': [True], 'dim': [1000], 'optimizer': ['rmsprop'], 'dim_word': [150], 'reload': [False], 'clip-c': [1.0], 'n-words': [50000], 'model': ['/home/hans/git/dl4mt-tutorial/session3/model.npz'], 'learning-rate': [0.0001], 'decay-c': [0.99]}
Loading data
Building model
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Building f_cost... Done
Computing gradient... Done
Building optimizers... Done
Optimization
...................................
...................................
...................................
Epoch  0 Update  65 Cost  17509.4765625 UD  0.767469167709
Epoch  0 Update  66 Cost  17504.859375 UD  0.822523832321
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Epoch  0 Update  67 Cost  17467.9296875 UD  0.752150058746
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Epoch  0 Update  68 Cost  17452.5976562 UD  0.831667900085
Epoch  0 Update  69 Cost  17394.2402344 UD  0.73230099678
Epoch  0 Update  70 Cost  17384.1113281 UD  0.830217123032
Minibatch with zero sample under length  15
Epoch  0 Update  71 Cost  17374.1601562 UD  0.820451974869
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Minibatch with zero sample under length  15
Epoch  0 Update  72 Cost  17322.9296875 UD  0.877825975418
Epoch  0 Update  73 Cost  17319.2441406 UD  0.862649917603
Epoch  0 Update  74 Cost  17258.2480469 UD  0.820302963257
Minibatch with zero sample under length  15
Epoch  0 Update  75 Cost  17266.3398438 UD  0.854918003082
Minibatch with zero sample under length  15

So please help as to what I should do to ensure that the model gets trained over sentences of upto 100 words in length? Also, can you point out to me where the actual value 15 comes from?

orhanf commented 8 years ago

Hi @hanskrupakar, by default the maxlen parameter is set to 50, as you can check here and please compare it with your fork. This value is passed to the data iterator and TextIterator filters the sequences respectively.

In your case, please check the average sequence length of your dataset, if your sequences are short in average you may need to further adjust maxlen or you can even introduce another hyper-parameter like minlen.

nyu-dl / dl4mt-tutorial

what is the role of 'maxlen' parameter? #55