Closed Vimos closed 6 years ago
n_ctx
corresponds to the number of position that can be encoded by the network. In the article, the authors mentioned this:
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
This means that the network do not know how to encode positions after the 512th one so 512 is the maximum value that n_ctx
can take.
When using the model all your inputs will be of length n_ctx
, you should therefore try to reduce its value as much as possible as it will give you large performance improvements (in terms of training and inference time).
Oh, I believe that's why n_ctx
is used to define the network structure and dynamic length is not working.
Thank you for the explanation!
From the model definition,
vocab
is used to define the size of embedding.I am guessing that the
n_ctx
here is used for the position embedding, but still not clear.In my case, I sometimes run into the following shape error if
n_ctx
is very large.Can anybody explain the code? Should I restrict
n_ctx
to a value? Thanks!