avacaondata commented 4 years ago

Hi there, I've pre-trained a REFORMER for 4 days with 500MB of text data, just to try how it works. Now I'm trying to use it for fine-tuning and it's taking huge time for each epoch... I'm using a nice GPU (the one you were jealous about :P ) but it's still taking too long, as you can see below. When compared to a normal BERT, for example, there's no point of comparison, as the latter needs only a couple of secs for fine-tuning while this one is taking hours.

EPOCH: 0%| | 0/40 [00:00<?, ?it/s] Training epoch 0: 0%| | 0/1041 [00:00<?, ?it/s] Training epoch 0: 0%| | 1/1041 [00:13<3:46:44, 13.08s/it] Training epoch 0: 0%| | 2/1041 [00:24<3:39:14, 12.66s/it] Training epoch 0: 0%| | 3/1041 [00:36<3:33:28, 12.34s/it] Training epoch 0: 0%| | 4/1041 [00:48<3:31:05, 12.21s/it] Training epoch 0: 0%| | 5/1041 [01:00<3:29:03, 12.11s/it] Training epoch 0: 1%| | 6/1041 [01:11<3:26:42, 11.98s/it] Training epoch 0: 1%| | 7/1041 [01:23<3:24:39, 11.88s/it] Training epoch 0: 1%| | 8/1041 [01:35<3:25:09, 11.92s/it] Training epoch 0: 1%| | 9/1041 [01:46<3:22:59, 11.80s/it] Training epoch 0: 1%| | 10/1041 [01:58<3:23:07, 11.82s/it] Training epoch 0: 1%| | 11/1041 [02:11<3:25:52, 11.99s/it] Training epoch 0: 1%| | 12/1041 [02:23<3:25:39, 11.99s/it] Training epoch 0: 1%| | 13/1041 [02:34<3:21:48, 11.78s/it] Training epoch 0: 1%|▏ | 14/1041 [02:46<3:23:27, 11.89s/it] Training epoch 0: 1%|▏ | 15/1041 [02:57<3:19:09, 11.65s/it] Training epoch 0: 2%|▏ | 16/1041 [03:10<3:22:35, 11.86s/it] Training epoch 0: 2%|▏ | 17/1041 [03:22<3:22:47, 11.88s/it] Training epoch 0: 2%|▏ | 18/1041 [03:33<3:22:16, 11.86s/it] Training epoch 0: 2%|▏ | 19/1041 [03:45<3:23:15, 11.93s/it] Training epoch 0: 2%|▏ | 20/1041 [03:57<3:20:54, 11.81s/it] Training epoch 0: 2%|▏ | 21/1041 [04:09<3:19:35, 11.74s/it] Training epoch 0: 2%|▏ | 22/1041 [04:21<3:22:12, 11.91s/it] Training epoch 0: 2%|▏ | 23/1041 [04:32<3:20:29, 11.82s/it] Training epoch 0: 2%|▏ | 24/1041 [04:44<3:16:36, 11.60s/it] Training epoch 0: 2%|▏ | 25/1041 [04:56<3:18:51, 11.74s/it] Training epoch 0: 2%|▏ | 26/1041 [05:07<3:17:10, 11.66s/it] Training epoch 0: 3%|▎ | 27/1041 [05:18<3:15:37, 11.58s/it] Training epoch 0: 3%|▎ | 28/1041 [05:30<3:15:43, 11.59s/it] Training epoch 0: 3%|▎ | 29/1041 [05:42<3:16:18, 11.64s/it] Training epoch 0: 3%|▎ | 30/1041 [05:54<3:16:54, 11.69s/it] Training epoch 0: 3%|▎ | 31/1041 [06:05<3:12:38, 11.44s/it] Training epoch 0: 3%|▎ | 32/1041 [06:16<3:11:49, 11.41s/it] Training epoch 0: 3%|▎ | 33/1041 [06:27<3:11:52, 11.42s/it] Training epoch 0: 3%|▎ | 34/1041 [06:39<3:13:15, 11.51s/it] Training epoch 0: 3%|▎ | 35/1041 [06:50<3:10:34, 11.37s/it] Training epoch 0: 3%|▎ | 36/1041 [07:02<3:12:29, 11.49s/it] Training epoch 0: 4%|▎ | 37/1041 [07:13<3:11:37, 11.45s/it] Training epoch 0: 4%|▎ | 38/1041 [07:24<3:09:23, 11.33s/it] Training epoch 0: 4%|▎ | 39/1041 [07:36<3:09:00, 11.32s/it] Training epoch 0: 4%|▍ | 40/1041 [07:47<3:09:20, 11.35s/it] Training epoch 0: 4%|▍ | 41/1041 [07:58<3:08:17, 11.30s/it]

Do you know which may be the problem? I've created this class for NER: class ReformerForTokenClassification(nn.Module):

def __init__(self, num_labels, model_dim, depth, 
             n_tokens, maxlen, heads, weights_file, n_hashes, dropout=0.2):
    super(ReformerForTokenClassification, self).__init__()
    self.num_labels = num_labels
    self.model_dim = model_dim
    self.reformer = ReformerLM(n_tokens, model_dim, depth, maxlen, heads,
                              n_hashes, return_embeddings=True)
    model_dict = self.reformer.state_dict()
    pretrained_dict = torch.load(weights_file)
    weights_dict = {k:v for k, v in pretrained_dict.items() if 'to_logits' not in k}
    self.reformer.load_state_dict(weights_dict)
    self.dropout = nn.Dropout(dropout)
    self.classifier = nn.Linear(self.model_dim, self.num_labels)

def forward(self, input_ids=None, labels=None):

    outputs = self.reformer(input_ids)
    sequence_output = self.dropout(outputs)
    logits = self.classifier(sequence_output)
    outputs = (logits, outputs[2:])

    if labels is not None:

        loss_fct = nn.CrossEntropyLoss()

        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        outputs = (loss, outputs[0], outputs[1])

    return outputs

model = ReformerForTokenClassification(num_labels=9, model_dim=768, depth=12, maxlen=512, n_tokens=tokenizer.vocab_size, heads=8, n_hashes=4, weights_file='ckpts_pequeño_oscar/model_state_dict.pt')

lucidrains commented 4 years ago

Yea, the LSH attention is definitely much slower at shorter sequence lengths. I introduced a setting full_attn_thres, where you can set the minimum sequence length at which the network switches to using LSH attention vs full attention.

avacaondata commented 4 years ago

And is it possible to use a model which has been trained with LSHAttention for fine-tuning using full-attention? Are there any other ways to speed up the model when fine-tuning? Thanks a lot for the help again man! :)

lucidrains commented 4 years ago

I haven't tried it yet, but I imagine it should be possible. After all, LSH is meant to be the sparser variant of full qk attention. No problem! Part of my goal in writing this is so people like you explore and share the results.

acriptis commented 4 years ago

Yeah I also stuck with this problem:

bert
--- Train duration: 54.78013753890991 seconds ---                                                                                       

reformer
--- Train duration: 2113.113841533661 seconds ---

I tried to set:

    use_full_attn: True  # use full self attention, for comparison
    full_attn_thres: 1024  # use full attention if context length is less than set value

But training speed has not changed... (I have sequences that are mostly shorter than 768 tokens).

mauceri commented 4 years ago

Hi, I didn't make any comparison with BERT but I built a toy training set extracted from Wikipedia.fr. The training set contains 590k lines, packed in 976 batches, the vocabulary contains 15132 tokens (fast learnbpe 10k). I created the following model : model = ReformerLM( num_tokens= 15132, dim = 512, emb_dim = 512, depth = 12, heads = 8, max_seq_len = 512, lsh_dropout = 0.1, causal = True, use_full_attn = True, full_attn_thres = 1024 ).cuda()

On an AWS g3s.xlarge (NVIDIA Tesla M60) a complete training loop over the 976 batches takes around 4000s or 4s by batch and I find it rather low, I expected better results. But I maybe missed something...

lucidrains commented 4 years ago

@acriptis @mauceri for a while I was trying to fit as many tokens into attention without running out of memory, so I chunked the processing of attention by default by the number of heads. could you both try again, but when instantiating the Reformer classes, add attn_chunk = 1. In the latest version, I defaulted it back to 1 again. Also, @acriptis , for turning on full attention at a lower length, you just need to set the full_attn_thres. use_full_attn will have full attention on all the time.

lucidrains commented 4 years ago

@mauceri also, just want to comment that with a length of 512, you really won't see too much benefits to using the Reformer. The returns for using the Reformer comes at lengths 2048 or greater. Figure 5 in the paper shows this nicely

mauceri commented 4 years ago

Hi,

Thanks for you kind answer.

I do not know the usages so forgive me if I'm going against the etiquette I posted my code below.

If I used 512, it's because I noticed it made the code running faster. I'm now in the configuration you recommend I think, and the time taken by each iteration rose to 6s (24s for each iteration of 4 gradient accumulation) vs 4 previously.

GRADIENT_ACCUMULATE_EVERY = 4 LEARNING_RATE = 1e-5 VALIDATE_EVERY = 100

SEQ_LEN = 2048 #On a pris soin plus haut de ne garder que des phrases de moins de 512 mots

ckpt_dir = './small/checkpoint/' torch.cuda.synchronize() stime = time.time() epochs = 100 model = ReformerLM( num_tokens= 15132, dim = 512, emb_dim = 512, depth = 12, heads = 8, max_seq_len = SEQ_LEN, lsh_dropout = 0.1, causal = True,

use_full_attn = True,

attn_chunks = 1,
full_attn_thres = 1024

) model.cuda()

model = TrainingWrapper(model, ignore_index = 0, pad_value = 0)

optim = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

if ckpt_dir is not None: assert os.path.isdir(ckpt_dir) try: logger.info(f'{datetime.now()} | Continuing from checkpoint...') print(f'{datetime.now()} | Continuing from checkpoint...') model.load_state_dict(torch.load(f'{ckpt_dir}/model.save.pt', map_location=device)) optim.load_state_dict(torch.load(f'{ckpt_dir}/optim.save.pt'))

except Exception as e:
    logger.info(f'{datetime.now()} | No checkpoint was found | {e}')
    print(f'{datetime.now()} | No checkpoint was found | {e}')

model.train()

model = torch.load(save_path)

for epoch in range(epochs): print("epoch %s"%epoch) train_it = train_ds.get_iterator(False)

i = 0   
while i < train_ds.n_batches :
    #torch.cuda.empty_cache() 
    model.train()

    k = 0
    while k < GRADIENT_ACCUMULATE_EVERY and i < train_ds.n_batches:
        i += 1
        k += 1
        x = next(train_it)
        x = x.cuda()
        loss = model(x, return_loss = True)
        loss.backward()

    optim.step()
    optim.zero_grad()

    print("\rtrain loss = ",loss.sum(),", etime = ",time.time() - stime, ", it = ",i, "/",train_ds.n_batches, end='')

    if i != 0 and i % VALIDATE_EVERY == 0:
        print("\ntrain loss = ",loss.sum(),", etime = ",time.time() - stime, ", it = ",i)

        torch.save(model.state_dict(), f'{ckpt_dir}/model.save.pt')
        torch.save(optim.state_dict(), f'{ckpt_dir}/optim.save.pt')
        cont = test_model(model,"Le département est un des plus vastes de France :")
        print(cont)
torch.save(model.state_dict(), f'{ckpt_dir}/model.save.pt')
torch.save(optim.state_dict(), f'{ckpt_dir}/optim.save.pt')

Le 4 mars 2020 à 19:07, Phil Wang notifications@github.com a écrit :

@mauceri https://github.com/mauceri also, just want to comment that with a length of 512, you really won't see too much benefits to using the Reformer. The returns for using the Reformer comes at lengths 2048 or greater. Figure 5 in the paper shows this nicely

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lucidrains/reformer-pytorch/issues/46?email_source=notifications&email_token=AAHXAPYGSYQWUSV77HBAP6LRF2KHRA5CNFSM4K4BMXR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZKGKY#issuecomment-594715435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHXAP5NGKGEYJF5HCF2BGTRF2KHRANCNFSM4K4BMXRQ.

lucidrains commented 4 years ago

@mauceri it increased because now you have switched to LSH at a sequence length of 2048. What is the batch size and sequence length of the training data you are feeding into the network?

mauceri commented 4 years ago

The batch size is 52048, it was 20512 previously (I cannot increase it more with this machine). I understand it increases because of the new sequence length, it is precisely the reason why I reduced it previously. With these settings it will take around 16 hours to run a single epoch on around 4% of wikipedia.fr http://wikipedia.fr/ with a very small vocabulary on a small spot AWS GPU instance, which is around $100 by epoch. Is it what you expected ? I thought it would have been faster, but maybe I missed a point in the paper.

Le 4 mars 2020 à 21:18, Phil Wang notifications@github.com a écrit :

@mauceri https://github.com/mauceri it increased because now you have switched to LSH at a sequence length of 2048. What is the batch size and sequence length of the training data you are feeding into the network?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lucidrains/reformer-pytorch/issues/46?email_source=notifications&email_token=AAHXAP4CK4Z4ERZUGX2MRYDRF2ZRTA5CNFSM4K4BMXR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN2B6ZA#issuecomment-594812772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHXAPZFBKY47X6RU2GJB7TRF2ZRTANCNFSM4K4BMXRQ.

mauceri commented 4 years ago

$100 by epoch for the entire corpus indeed.

Le 4 mars 2020 à 21:41, Christian Mauceri cmauceri@gmail.com a écrit :

$100 by epoch

lucidrains commented 4 years ago

@mauceri It should be at least twice as slow as the regular transformer because the reversible net is rerunning the forward pass to reconstitute the activations removed from memory

mauceri commented 4 years ago

Ah, thank you ! That’s the point I missed :) And thank you for your great work !

Christian Mauceri, PhD Le 4 mars 2020 à 22:29 +0100, Phil Wang notifications@github.com, a écrit :

@mauceri It should be at least twice as slow as the regular transformer because the reversible net is rerunning the forward pass to reconstitute the activations removed from memory — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lucidrains commented 4 years ago

@mauceri thank you! and if you dig into the code and find any points of further optimization, please share a pull request!

mauceri commented 4 years ago

Sure, I’ll do 😊

Christian Mauceri, PhD Le 4 mars 2020 à 23:07 +0100, Phil Wang notifications@github.com, a écrit :

@mauceri thank you! and if you dig into the code and find any points of further optimization, please share a pull request! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lucidrains commented 4 years ago

@mauceri I added a new setting reverse_thres, which you can set at a very high value to turn off reversible nets altogether, and explore the speed / memory tradeoffs a bit more

mauceri commented 4 years ago

Hi @lucidrains,

reverse_thresh works fine but unfortunately I could not test speed impact precisely because it works. Indeed when I set it to a value greater than max_seq_len CUDA falls on its knees because the model becomes too big, so it is a good news. I'm going to try this GPT-2 like model https://github.com/Andras7/gpt2-pytorch for comparisons pupose (not sure my little instance can take it). After that I'm going to request an increase of my AWS limits in order to test your model on a multi GPU, I saw you already tested it. I keep you posted on the results next week or the week after.

lucidrains commented 4 years ago

@mauceri perfect!

lucidrains commented 4 years ago

@mauceri i've done another refactor of the reversible net in the latest version, it should be a bit faster than before

lucidrains commented 4 years ago

closing because issue has been addressed

lucidrains / reformer-pytorch

Using a pre-trained REFORMER for fine-tuning takes soooo looong #46

use_full_attn = True,

model = torch.load(save_path)