gsuuon / ad-llama

Structured inference with Llama 2 in your browser
https://gsuuon.github.io/ad-llama/
MIT License
52 stars 2 forks source link
javascript llm local-llm mlc npm

ad-llama 🦙

Use tagged template literals for structured Llama 2 inference locally in browser via mlc-llm. Runs on Chromium browsers with WebGPU support.

https://github.com/gsuuon/ad-llama/assets/6422188/d38ef8c3-8ba1-4757-a992-ea8720776780

Check out the playground, a simple dnd npc generator, or a multi-step color generator built on solid-js. There's also a documentation page -- say hi in discord!

Usage

npm install -S ad-llama

import { loadModel, ad, report } from 'ad-llama'

const loadedModel = await loadModel('Llama-2-7b-chat-hf-q4f32_1', report(console.info))
const { context, a } = ad(loadedModel)

const dm = context(
  'You are a dungeon master.',
  'Create an NPC based on the Dungeons and Dragons universe.'
)

const npc = dm`{
  "description": "${a('short description')}",
  "name": "${a('character name')}",
  "weapon": "${a('weapon')}",
  "class": "${a('primary class')}"
}`

const generatedNpc = await npc.collect(partial => {
  switch (partial.type) {
    case 'gen':
    case 'lit':
      console.info(partial.content)
      break
  }
})

console.log(generatedNpc)

For an example of more complicated usage including validation, retry logic and transforms check the hackernews who's hiring example.

Generation

Each expression in the template literal is a new prompt and options. The prompt given for each expression is added to the system and preprompt established in context, and prior completion text (literal parts and as well as inferences) are added to the end of the LLM prompt as a partially completed assistant response (i.e. after [/INST]).

Each template expression can be configured independently - you can set a different temperature, token count, max length and more. Check the TemplateExpressionOptions type in ./src/types.ts for all options. a adds the preword to the expression prompt (by default "Generate"), use prompt to provide a naked prompt or set the preword option. A plain string gets inserted as literal text, just like normal template literals.

template`{
  "name": "${characterName}",
  "description": "${a('clever description', {
    maxTokens: 1000,
    stops: ['\n']
  })}",
  "class": "${a('a primary class for the character')}"
}`

Biased Sampling

You can modify the sampler for each expression -- this allows you to adjust the logits before sampling. You can, for example, only accept number characters for one expression, while in another avoid specific strings. The main example here shows some of these options. You can build your own, but there are some helper functions exposed as sample to build samplers. The loaded model object has a bias field which configures the sampling - avoid, prefer allow you to adjust relative likelihood of certain tokens ids, while reject, accept change the logits to negative infinity for some ids (or all other ids).

import { sample } from 'ad-llama'

const model = await loadModel(...)

const { bias } = model
const { oneOf, consistsOf } = sample

const { context, a } = ad(model)
const character = context('Create a character')

character`{
  "weapon": "${a('special weapon', {
    sampler: bias.prefer(oneOf(['Nun-chucks', 'Beam Cannon']), 10),
  })}",
  "age": ${a('age', {
    sampler: bias.accept(consistsOf(['0','1','2','3','4','5','6', '7', '8', '9'])),
    maxTokens: 3
  })},
}`

As tokenizers are stateful, a simple encoding of just the provided strings won't necessarily produce tokens that would fit into the existing sequence. For example, with Llama 2's tokenizer foo and "foo" encode to completely different tokens:

encode('"foo"') === [376, 5431, 29908]
encode('foo') === [7953]
decode([5431]) === 'foo'
decode([7953]) === 'foo'

oneOf, consistsOf try to figure out relevant tokens for the provided strings within the context of the currently inferring expression.

consistsOf is for classes of characters - each sample will have the same token logits modified based on the given relevant tokens. So in consistsOf(['a','b']), every sample will have tokens for 'a' and 'b' modified.

oneOf is for strings - each sample has logits modified depending on the tokens which are still relevant given the already sampled tokens. For example, if we have oneOf(['ranger', 'wizard']) and we've already sampled 'w', the only next relevant tokens would be from 'izard'. If you want oneOf to stop at one of the choices, include the stop character (by default the next character after the expression), eg: oneOf(['ranger"', 'wizard"']).

Even though reject(oneOf(['ranger', 'wizard'])) will never make it past the first token for either of the strings, giving the entire string still lets you target the correct tokens for completing those specific strings.

Validation

You can provide an expression validation function with a retry count. If validation fails, that expression will be attempted again up to retry times, after which whatever was generated is taken. You can also transform the result of the expression generation (this happens whether validation passes or not).

validate?: {
  check?: (partial: string) => boolean
  transform?: (partial: string) => string
  retries: number
}

Vite HMR

Waiting for models to reload can be tedious, even when they're cached. ad-llama should work with vite HMR so the loaded models stay in memory. Put this in your source file to create an HMR boundary:

import { loadModel, ad, guessModelSpecFromPrebuiltId } from 'ad-llama'

+ if (import.meta.hot) { import.meta.hot.accept() }

const loadedModel = await loadModel(guessModelSpecFromPrebuiltId('Llama-2-7b-chat-hf-q4f32_1'))

Build

Motivation

I was inspired by guidance but felt that tagged template literals were a better way to express structured inference. I also think grammar based sampling is neat, and wanted to add a way to plug something like that into MLC infrastructure.

Todos