lucidrains / RETRO-pytorch

Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
Apache License 2.0
850 stars 106 forks source link

How to give Prompt to trained RETRO Model? #33

Open shahmeer99 opened 1 year ago

shahmeer99 commented 1 year ago

I am following the instructions on the RETRO-pytorch GItHub repo. After training my model, how do I go about using it to generate responses?

retro = RETRO(
    chunk_size = 64,                         # the chunk size that is indexed and retrieved (needed for proper relative positions as well as causal chunked cross attention)
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dim
    enc_depth = 2,                           # encoder depth
    dec_dim = 796,                           # decoder model dim
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (3, 6, 9, 12),   # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization, for scaling to 1000 layers
)

seq = torch.randint(0, 20000, (2, 2048 + 1))      # plus one since it is split into input and labels for training
retrieved = torch.randint(0, 20000, (2, 32, 2, 128)) # retrieved tokens - (batch, num chunks, num retrieved neighbors, retrieved chunk with continuation)

loss = retro(seq, retrieved, return_loss = True)
loss.backward()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = './retro_training_set/',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'    
)

Now when I want to give this model a text input (any prompt), how would I go about doing that? Which method or function would I use? Which model/tokenizer should I use to encode the input prompt and then decode the model output tensor? Is there a method for that?

Example Prompt: "The movie Dune was released in"

filipesilva commented 1 year ago

https://github.com/lucidrains/RETRO-pytorch/issues/23 contains a notebook with a good example.

I think putting it together with the README instructions looks like this:

import torch
from retro_pytorch import RETRO, TrainingWrapper

# instantiate RETRO, fit it into the TrainingWrapper with correct settings

retro = RETRO(
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dimension
    enc_depth = 3,                           # encoder depth
    dec_dim = 768,                           # decoder model dimensions
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (1, 3, 6, 9),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
).cuda()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = './text_folder',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'
)

# get the dataloader and optimizer (AdamW with all the correct settings)

train_dl = iter(wrapper.get_dataloader(batch_size = 2, shuffle = True))
optim = wrapper.get_optimizer(lr = 3e-4, wd = 0.01)

# now do your training
# ex. one gradient step

seq, retrieved = map(lambda t: t.cuda(), next(train_dl))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro(
    seq,
    retrieved,
    return_loss = True
)

# one gradient step

loss.backward()
optim.step()
optim.zero_grad()

# do above for many steps, then ...

# encode prompt
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

prompt_str = "The movie Dune was released in"

prompt_ids = tokenizer(prompt_str)['input_ids'][1:-1]

prompt = torch.tensor([prompt_ids])

sampled = wrapper.generate(prompt, filter_thres = 0.9, temperature = 1.0)

# decode sample
decoded = tokenizer.decode(sampled.tolist()[0])

print(decoded)

The code in the notebook for training several times is probably needed for good results though.

aakashgoel12 commented 1 year ago

23 contains a notebook with a good example.

I think putting it together with the README instructions looks like this:

import torch
from retro_pytorch import RETRO, TrainingWrapper

# instantiate RETRO, fit it into the TrainingWrapper with correct settings

retro = RETRO(
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 896,                           # encoder model dimension
    enc_depth = 3,                           # encoder depth
    dec_dim = 768,                           # decoder model dimensions
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (1, 3, 6, 9),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
).cuda()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = './text_folder',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'
)

# get the dataloader and optimizer (AdamW with all the correct settings)

train_dl = iter(wrapper.get_dataloader(batch_size = 2, shuffle = True))
optim = wrapper.get_optimizer(lr = 3e-4, wd = 0.01)

# now do your training
# ex. one gradient step

seq, retrieved = map(lambda t: t.cuda(), next(train_dl))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro(
    seq,
    retrieved,
    return_loss = True
)

# one gradient step

loss.backward()
optim.step()
optim.zero_grad()

# do above for many steps, then ...

# encode prompt
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

prompt_str = "The movie Dune was released in"

prompt_ids = tokenizer(prompt_str)['input_ids'][1:-1]

prompt = torch.tensor([prompt_ids])

sampled = wrapper.generate(prompt, filter_thres = 0.9, temperature = 1.0)

# decode sample
decoded = tokenizer.decode(sampled.tolist()[0])

print(decoded)

The code in the notebook for training several times is probably needed for good results though.

@filipesilva Can you please share notebook which you are referencing, its not accessible. or if you can share code for training multiple epochs, will be really very helpful. Thanks

filipesilva commented 1 year ago

@aakashgoel12 looks like the notebook that was in #23 is not there anymore. I don't have a copy of it, unfortunately. All the code I have is what I put in the comment.

aakashgoel12 commented 1 year ago

@aakashgoel12 looks like the notebook that was in #23 is not there anymore. I don't have a copy of it, unfortunately. All the code I have is what I put in the comment.

Thanks @filipesilva. Can you please check if what I have written below is correct or need some modification. Thanks in advance.

num_epochs=3
train_dl = iter(wrapper.get_dataloader(batch_size = 4, shuffle = True))
for epoch in range(num_epochs):
    counter=0    
    for batch in tqdm(train_dl):
        seq, retrieved = map(lambda t: t.cuda(), batch)
        loss = retro(
            seq,
            retrieved,
            return_loss = True)
        # one gradient step
        loss.backward()
        optim.step()
        optim.zero_grad()
        if counter%10==0:
            print("Epoch:{}, BatchNo:{}, Loss:{}".format(epoch, counter, loss))
        counter+=1
    print("After epoch - {}, loss: {}".format(epoch,loss))
filipesilva commented 1 year ago

I really can't tell 😅 I only played around with this a couple of months ago and never really tried again.

yerinNam commented 6 months ago

hello documents_path = './text_folder', # path to folder of text glob = '*/.txt', # text glob chunks_memmap_path = './train.chunks.dat', # path to chunks seqs_memmap_path = './train.seq.dat', # path to sequence data doc_ids_memmap_path = './train.doc_ids.dat',

Is this in the path retro? Or what dataset is it?