A tiny DSL to make it easier to explore and experiment with AI pipelines.
summarize.aipl
Here's a prime example, a multi-level summarizer in the "map-reduce" style of langchain:
#!/usr/bin/env bin/aipl
# fetch url, split webpage into chunks, summarize each chunk, then summarize the summaries.
# the inputs are urls
!read
# extract text from html
!extract-text
# split into chunks of lines that can fit in the context window
!split maxsize=8000 sep=\n
# have GPT summary each chunk
!format
Please read the following section of a webpage (500-1000 words) and provide a
concise and precise summary in a few sentences, optimized for keywords and main
content topics. Write only the summary, and do not include phrases like "the
article" or "this webpage" or "this section" or "the author". Ensure the tone
is precise and concise, and provide an overview of the entire section:
"""
{_}
"""
!llm model=gpt-3.5-turbo
# join the section summaries together
!join sep=\n-
# have GPT summarize the combined summaries
!format
Based on the summaries of each section provided, create a one-paragraph summary
of approximately 100 words. Begin with a topic sentence that introduces the
overall content topic, followed by several sentences describing the most
relevant subsections. Provide an overview of all section summaries and include
a conclusion or recommendations only if they are present in the original
webpage. Maintain a precise and concise tone, and make the overview coherent
and readable, while preserving important keywords and main content topics.
Remove all unnecessary text like "The document" and "the author".
"""
{_}
"""
!llm model=gpt-3.5-turbo
!print
usage: aipl [-h] [--debug] [--test] [--interactive] [--step STEP] [--step-breakpoint] [--step-rich] [--step-vd] [--dry-run] [--cache-db CACHEDBFN] [--no-cache]
[--output-db OUTDBFN] [--split SEPARATOR]
[script_or_global ...]
AIPL interpreter
positional arguments:
script_or_global scripts to run, or k=v global parameters
options:
-h, --help show this help message and exit
--debug, -d abort on exception
--test, -t enable test mode
--interactive, -i interactive REPL
--step STEP call aipl.step_<func>(cmd, input) before each step
--step-breakpoint, -x
breakpoint() before each step
--step-rich, -v output rich table before each step
--step-vd, --vd open VisiData with input before each step
--dry-run, -n do not execute @expensive operations
--cache-db CACHEDBFN, -c CACHEDBFN
sqlite database for caching operators
--no-cache sqlite database for caching operators
--output-db OUTDBFN, -o OUTDBFN
sqlite database accessible to !db operators
--split SEPARATOR, --separator SEPARATOR, -s SEPARATOR
separator to split input on
This is the basic syntax:
#
as the first character of a line, and ignore the whole line.!
as the first character of a line.!
command.Commands can take positional and/or keyword arguments, separated by whitespace.
!cmd arg1 key=value arg2
Keyword arguments have an =
between the key and the value, and non-keyword arguments are those without a =
in them.
!cmd
will call the Python function registered to the cmd
operator with the arguments given, as an operator on the current value.
Any text following the command line is dedented (and stripped) and added verbatim as a prompt=
keyword argument.
Argument values may include Python formatting like {input}
which will be replaced by values from the current row (falling back to parent rows, and ultimately the provided globals).
Prompt values, on the other hand, are not automatically formatted. !format
go over every leaf row and return the formatted prompt as its output.
!literal will set its prompt as the toplevel input, without formatting.
The AIPL syntax will continue to evolve and be clarified over time as it's used and developed.
Notes:
!literal
) to the end of the pipeline (often !print
).!abort
(in=None out=None)
Abort the current chain.
!cluster
(in=1 out=1)
Cluster rows by embedding into n clusters; add label column.
!columns
(in=1.5 out=1.5)
Create new table containing only these columns.
!comment
(in=None out=None)
Do nothing (ignoring args and prompt).
!cross
(in=0.5 out=1.5)
Construct cross-product of current input with given global table
!global
(in=100 out=1.5)
Save toplevel input into globals.
!unbox
(in=1.5 out=1.5)
None
!csv-parse
(in=None out=1.5)
Converts a .csv into a table of rows.
!dbopen
(in=None out=0)
Open connection to database.
!dbquery
(in=0.5 out=1.5)
Query database table.
!dbdrop
(in=None out=None)
Drop database table.
!dbinsert
(in=0.5 out=None)
Insert each row into database table.
!option
(in=None out=None)
Set option=value.
!debug
(in=None out=None)
set debug flag and call breakpoint() before each command
!def
(in=0 out=None)
Define composite operator from cmds in prompt (must be indented).
!extract-text-all
(in=0 out=0)
Extract all text from HTML
!extract-text
(in=0 out=0)
Extract meaningful text from HTML
!extract-links
(in=0 out=1.5)
Extract (linktext, title, href) from tags in HTML
!filter
(in=1.5 out=1.5)
Return copy of table, keeping only rows whose value is Truthy.
!format
(in=0.5 out=0)
Format prompt text (right operand) as a Python string template, substituting values from row (left operand) and global context.
!groupby
(in=1.5 out=1.5)
Group rows into tables, by set of columns given as args.
!require-input
(in=100 out=100)
Ensure there is any input at all; if not, display the prompt and read input from the user.
!join
(in=1 out=0)
Join inputs with sep into a single output scalar.
!json
(in=100 out=0)
Convert Table into a json blob.
!json-parse
(in=0 out=1.5)
Convert a json blob into a Table.
!literal
(in=None out=0)
Set prompt as top-level input, without formatting.
!llm
(in=0 out=0)
Send chat messages to model
(default: gpt-3.5-turbo). Lines beginning with @@@s or @@@a are sent as system or assistant messages respectively (default user). Passes all named args directly to API.
!llm-embedding
(in=0 out=0.5)
Get a text embedding for a string from model
: a measure of text-relatedness, to be used with e.g. !cluster.
!match
(in=0 out=0)
Return a bool with whether value matched regex. Used with !filter.
!metrics-accuracy
(in=1.5 out=0)
None
!metrics-precision
(in=1.5 out=0)
None
!metrics-recall
(in=1.5 out=0)
None
!name
(in=1.5 out=1.5)
Rename current input column to given name.
!nop
(in=None out=None)
No operation.
!pdf-extract
(in=0 out=0)
Extract contents of pdf to value.
!print
(in=0 out=None)
Print to stdout.
!python
(in=None out=None)
exec() Python toplevel statements.
!python-expr
(in=0.5 out=0)
Add columns for Python expressions.
!python-input
(in=0 out=1.5)
eval() Python expression and use as toplevel input table.
!ravel
(in=100 out=1.5)
All of the leaf scalars in the value column become a single 1-D array.
!read
(in=0 out=0)
Return contents of local filename.
!read-bytes
(in=0 out=0)
Return contents of URL or local filename as bytes.
!ref
(in=1.5 out=1.5)
Move column on table to end of columns list (becoming the new .value)
!regex-capture
(in=0 out=0.5)
Capture from prompt regex into named matching groups.
!regex-translate
(in=0 out=0)
Translate input according to regex translation rules in prompt, one per line, with regex and output separated by whitespace:
Dr.? Doctor
Jr.? Junior
!replace
(in=0 out=0)
Replace find
in all leaf values with repl
.
!sample
(in=1.5 out=1.5)
Sample n random rows from the input table.
!save
(in=0 out=None)
Save to given filename.
!sh
(in=0 out=1.5)
Run the command described by args. Return (retcode, stderr, stdout) columns.
!shtty
(in=None out=0.5)
Run the command described by args. Return (retcode, stderr, stdout) columns.
!sort
(in=1.5 out=1.5)
Sort the table by the given columns.
!grade-up
(in=1.5 out=1)
Assign ranks to unique elements in an array, incrementally increasing each by its corresponding rank value.
!split
(in=0 out=1)
Split text into chunks based on sep, keeping each chunk below maxsize.
!split-into
(in=0 out=0.5)
Split text by sep into the given column names.
!take
(in=1.5 out=1.5)
Return a table with first n rows of t
!test-input
(in=100 out=1.5)
In test mode, replace input with prompt.
!test-equal
(in=0 out=None)
In test mode, error if value is not equal to prompt.
!test-json
(in=100 out=None)
Error if value Column is not equal to json blob in prompt.
!url-split
(in=0 out=0.5)
Split url into components (scheme, netloc, path, params, query, fragment).
!url-defrag
(in=0 out=0)
Remove fragment from url.
!xml-xpath
(in=0 out=1)
Return a vector of XMLElements from parsing entries in value.
!xml-xpaths
(in=0 out=0.5)
Return a vector of XMLElements from parsing entries in value; kwargs become column_name=xpath.
!aipl-ops
(in=0 out=0)
None
It's pretty easy to define a new operator that can be used right away.
For instance, here's how the !join
operator might be defined:
@defop('join', rankin=1, rankout=0)
def op_join(aipl:AIPL, v:List[str], sep=' ') -> str:
'Concatenate text values with *sep* into a single string.'
return sep.join(v)
@defop(...)
registers the decorated function as the named operator.
rankin
/rankout
indicate what the function takes as input, and what it returns:
0
: a scalar (number or string)0.5
: a whole row (a mapping of key/value pairs)1
: a vector of scalar values (e.g. List[str]
as above)1.5
: a whole Table (list of the whole table (array of rows)None
: nothing (the operator is an input "source" if rankin is None; it is a pass-through if rankout is None)arity
is how many operands it takes (only 0
and 1
supported currently)The join operator is rankin=1 rankout=0
which means that it takes a list of strings and outputs a single string.
@expensive
decorator to operators that actually go to the network or use an LLM; this will persistently cache the results in a local sqlite database.
The fundamental data structure is a Table: an array of hashmaps ("rows"), with named Columns that key into each Row to get its value.
A value can be a string or a number or another Table.
The value of a row is the value in the rightmost column of its table. The rightmost column of a table is a vector of values representing the whole table.
A simple vector has only strings or numbers. A simple table has a simple rightmost value vector and is Rank 0. Each nesting of tables in the rightmost value vector increases its Rank by 1.
Each operator consumes 0 or 1 or 2 operands (its arity
), and produces one result, which becomes the operand for the next operator.
Each operator has an "in rank" and an "out rank", which is the rank of the operands they input and output.
By default, each operator is applied across the deepest nested table. The result of each operator is then placed in the deepest nested table (or its parent).
With rankin=0
and rankout
of:
With rankin=0.5
, and rankout
of:
With rankin=1
, and rankout
of:
With rankin=2
, and rankout
of:
In addition to operands, operators also take parameters, both positional and named (args
and kwargs
in Python).
These cannot have spaces, but they can have Python format strings like {input}
.
The identifiers available to Python format strings come from a chain of contexts:
Come chat with us on Discord bluebird.sh/chat or Mastodon @saulpw@fosstodon.org.
If you want to get updates about I'm playing with, you can sign up for my AI mailing list.
Licensed under MIT.