Creating batches of tokenized text

lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

MIT License

5.57k stars 643 forks source link

Creating batches of tokenized text #33

Open achen353 opened 3 years ago

achen353 commented 3 years ago

How do you go about stacking unequal lengths of tokenized text to create batches of text for training?

lucidrains commented 3 years ago

@achen353 Hi Andrew, thanks for your interest! I'll be instrumenting the full training script this weekend :) stay tuned..

achen353 commented 3 years ago

@lucidrains Thanks Phil that would be great. My GPU doesn't have PyTorch support yet so I'm working to implementing the model on TF. One idea I had was to create a fixed-length batch, fill the unused tail length of each sequence with 0 and make sure that their corresponding masks are "False". This would require each encoding value to be bumped up by one so value 0 doesn't represent any information.

lucidrains commented 3 years ago

Yup, that is the right way to go about it! I'll create all the necessary tools for training end to end without much coding, soon

Edit: And yup, you need to reserve 0 for padding and 1 for , so add 2 to your encoded text ids!