microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.51k stars 4.29k forks source link

CNTK Brain Script API Definition of Mini Batch Size #3430

Open ghost opened 5 years ago

ghost commented 5 years ago

As per the documentation of CNTK brainscript API, there seems to be a difference in the definition of the minibatchsize compared to other frameworks. In the brainscript API the minibatch size denotes the number of samples(items/data points) in each sequence as opposed to the individual sequences themselves. Can someone help me clarify this? So for example, let's say we have 100 sequences of 50 length(items/data points) each. In that case, if we use a minibatchsize of say 20, does the first minibatch contain only the first 20 data points of the first sequence?

robert1826 commented 5 years ago

I think the first minibatch will include the first sequence only as minibatchsize is NOT a hard constraint but rather a soft one ... i.e. CNTK will try to get the closest number of sequences such that the sum of the individual datapoints is close to the chosen minibatchsize

ghost commented 5 years ago

@robert1826 Thanks for the response. Does that mean that the sequences are not broken in the middle? In the case of a Recurrent Neural Network, I was thinking that CNTK performs some sort of truncated backpropagation through time.

robert1826 commented 5 years ago

according to the CNTK docs at https://docs.microsoft.com/en-us/cognitive-toolkit/interpreting-epoch_size-and-minibatch_size_in_samples-and-minibatchsource.next_minibatch-in-cntk

In case of variable-length inputs, minibatch_size_in_samples refers to the number of items in these sequences, not the number of sequences. SGD will try to fit up to as many sequences as possible into the minibatch that do not exceed minibatch_size_in_samples total samples. If several inputs are given, tensors are added to the current minibatch until one of the inputs exceeds the minibatch_size_in_samples

Despite our clear definition of minibatch_size_in_samples being the number of samples between model updates, there are two occasions where we must relax the definition: sequential data: Variable-length sequences do not generally sum up to exactly the requested minibatch size. In this case, as many sequences as possible are packed into a minibatch without exceeding the requested minibatch size (with one exception: If the next one sequence in the randomized corpus exceeds the length of the minibatch size, the minibatch size will consist of this sequence).

so my original answer holds true and the next minibatch will be of size 1 and it will NOT be truncated

ghost commented 5 years ago

Thanks for the clear explanation @robert1826. Now I understand it.