lucidrains / routing-transformer

Fully featured implementation of Routing Transformer
MIT License
282 stars 29 forks source link

Sequence length limited #17

Closed Henrykwokkk closed 3 years ago

Henrykwokkk commented 3 years ago

I tried this model, but the sequence length that the Routing Transformer can process seemed limited. I set the batch size as 16 and the sequence length as 1024, but it was out of GPU memory.

lucidrains commented 3 years ago

Hmm, that doesn't seem right, have you tried running the colab? Want to share your script?

lucidrains commented 3 years ago

How deep is your network? Try turning on reversibility?

Henrykwokkk commented 3 years ago

How deep is your network? Try turning on reversibility?

The depths of both the encoder and decoder are 3. Let me give you the feedback later.

lucidrains commented 3 years ago

how much memory are you working with? can you show me your full settings?

Henrykwokkk commented 3 years ago

The setting is shown as follows: NUM_BATCHES = int(1e5) BATCH_SIZE = 32 LEARNING_RATE = 1e-4 GENERATE_EVERY = 100 NUM_TOKENS = 256 + 2 ENC_SEQ_LEN = 1024 DEC_SEQ_LEN = 2048 model = RoutingTransformerEncDec( dim=512, enc_num_tokens=NUM_TOKENS, enc_depth=3, enc_heads=8, enc_max_seq_len=ENC_SEQ_LEN, enc_window_size=32, dec_num_tokens = NUM_TOKENS, dec_depth = 3, dec_heads = 8, dec_max_seq_len=DEC_SEQ_LEN, dec_window_size=32, ).cuda() RuntimeError happened: Tried to allocate 64.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 20.94 MiB free; 9.89 GiB reserved in total by PyTorch)

Similarly, I also had this problem in implementing the reformer model, When I implemented about 500 batches, it was out of memory.

lucidrains commented 3 years ago

So firstly, turn on reversibility, and second, you can decrease your batch size and do gradient accumulation instead

Henrykwokkk commented 3 years ago

So firstly, turn on reversibility, and second, you can decrease your batch size and do gradient accumulation instead

I turned on reversibility and set the batch size 8. But the training stopped at batch 172 and RuntimeError happened because CUDA out of memory.

lucidrains commented 3 years ago

@guohanyang1994 make your batch size even smaller and increase your gradient accumulation

Henrykwokkk commented 3 years ago

Could I ask why CUDA out of memory occurs during training (batch 172) rather than at the beginning of training?

tomweingarten commented 3 years ago

Hard to say without seeing the code, but are your batches different sizes? It's possible it takes that long to hit the longest combination of sequence lengths.

Henrykwokkk commented 3 years ago

The batch size is fixed at 4. The sequence length is set to 2048 at most, but it still stops at about the 1200th batch. I am still confused :( But thanks for your reply.

lucidrains commented 3 years ago

@guohanyang1994 are you sure you don't have a memory leak? Routing Transformer has been trained on GPT3 sized datasets successfully by others, so I doubt there's any problems with the framework

Henrykwokkk commented 3 years ago

Oh yeah, exactly it is the memory leak problem. I have fixed it and thank you so much. Sorry to bother you as an NLP beginner.

lucidrains commented 3 years ago

ok np :D