nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.19k stars 712 forks source link

What are Beams? #118

Closed nixtoshi closed 1 year ago

nixtoshi commented 1 year ago

Could you explain in a simple way what Beams are when configuring the model? 😅

I can't find a simple explanation of how to adjust he beams slider when testing WizardLM at: https://33aaac370fece283.gradio.app/

I understand all the other variables when configuring generations, but I don't understand Beams.

Online, I simply found that using more Beams results in a better generation, but it is more computationally costly, and the generation takes more time.

I also understood that what Beams do is basically test more token chains trying to find the token chain with the highest probability of fitting the other criteria. But I'm not sure wether this is accurate, or if I understood this correctly.

Thank you in advance!

nlpxucan commented 1 year ago

Beam refers to the beam search of the generation model. A larger beam expands the search space of the generation model, leading to better results but slower performance.

nixtoshi commented 1 year ago

Thank you! I'm closing this issue.

ChatGPT using GPT4 gave me this answer as well:

In the context of language models, and in particular, large language models (LLMs), Beam Search is a heuristic search algorithm used to improve the quality of the generated text. It is mainly used in sequence generation tasks such as machine translation, text generation, and speech recognition.

At its core, Beam Search is an extension of the Greedy Search algorithm. In a greedy search, at each time step, the model selects the word with the highest predicted probability as the next word in the sequence, and this selection is irreversible. However, this method is "greedy" because it makes locally optimal choices at each step without considering the impact on the overall sequence.

Beam Search tries to mitigate this by considering multiple (a "beam" of) possibilities at each step. Instead of choosing the best option at each step, it keeps track of multiple options (the "beam width") and explores these parallel paths, updating them as it goes along. At each step, it extends all the options in the current beam, then selects the best options from this extended set to form the beam for the next step.

The advantage of Beam Search over Greedy Search is that it can potentially find better overall sequences by exploring more paths. However, it comes with increased computational cost because the model needs to keep track of and extend multiple paths instead of just one.

One important thing to note is that Beam Search doesn't guarantee finding the globally optimal solution because it's still a form of greedy algorithm - it only considers a limited number of best options at each step (determined by the beam width) rather than exploring all possible paths. However, in practice, it often provides a good trade-off between computational efficiency and solution quality.

And GPT-3.5's response:

In the context of language models, Beam Search is a decoding algorithm commonly used in natural

language processing tasks such as machine translation, text generation, and speech recognition. It is used to generate the most likely sequence of words given a source input or prompt.

When generating text with a language model, the model assigns probabilities to each possible next word in the sequence. Beam Search is an algorithm that helps select the most probable sequence of words by exploring multiple candidates in parallel.

Here's how Beam Search works:

  1. Initialization: The algorithm starts with an initial input, usually a seed word or a prompt, and generates the first set of candidate words based on the language model's probabilities.

  2. Expanding candidates: From the initial set of candidates, the algorithm considers all possible next words and calculates their probabilities based on the language model. It expands the search by creating new sequences with each candidate word appended to the existing candidates.

  3. Selecting top candidates: The algorithm selects the top candidates with the highest probabilities based on the language model's output. The number of candidates to keep is determined by a parameter called the beam width or beam size. The higher the beam width, the more candidates are considered, but it also increases computational complexity.

  4. Pruning: After selecting the top candidates, the algorithm prunes the remaining candidates, discarding the low-probability sequences to reduce computational resources.

  5. Iteration: Steps 2-4 are repeated iteratively, with the selected candidates becoming the input for the next iteration. This process continues until a maximum length is reached, or a termination condition is met (e.g., generating an end-of-sentence token).

  6. Final selection: Once the termination condition is met, the algorithm selects the sequence with the highest overall probability as the final output.

By exploring multiple candidates simultaneously and selecting the most probable sequences, Beam Search helps overcome the limitations of greedy decoding, where only the most probable word is chosen at each step. Beam Search allows for more diverse outputs and captures a wider range of possibilities, but it can also be sensitive to choosing the right beam width, as a narrow beam may lead to overly conservative predictions, while a wide beam may generate less coherent or more redundant outputs.

Researchers and practitioners often combine Beam Search with other techniques, such as length normalization, to improve the quality and diversity of the generated text.