simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.01k stars 220 forks source link

Command showing available options for installed models #82

Closed simonw closed 1 year ago

simonw commented 1 year ago

This might be part of llm models list or may be something else.

Follows:

simonw commented 1 year ago

I'll do this:

llm models list --options

And introspect the Options class.

simonw commented 1 year ago

I got this working:

OpenAI Chat: gpt-3.5-turbo (aliases: 3.5, chatgpt)
  temperature: float - What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
  max_tokens: int - Maximum number of tokens to generate
  top_p: float - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Recommended to use top_p or temperature but not both.
  frequency_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
  presence_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
  stop: str - A string where the API will stop generating further tokens.
  logit_bias: Union[dict, str, NoneType] - Modify the likelihood of specified tokens appearing in the completion.
OpenAI Chat: gpt-3.5-turbo-16k (aliases: chatgpt-16k, 3.5-16k)
  temperature: float - What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
  max_tokens: int - Maximum number of tokens to generate
  top_p: float - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Recommended to use top_p or temperature but not both.
  frequency_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
  presence_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
  stop: str - A string where the API will stop generating further tokens.
  logit_bias: Union[dict, str, NoneType] - Modify the likelihood of specified tokens appearing in the completion.
OpenAI Chat: gpt-4 (aliases: 4, gpt4)
  temperature: float - What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
  max_tokens: int - Maximum number of tokens to generate
  top_p: float - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Recommended to use top_p or temperature but not both.
  frequency_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
  presence_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
  stop: str - A string where the API will stop generating further tokens.
  logit_bias: Union[dict, str, NoneType] - Modify the likelihood of specified tokens appearing in the completion.
OpenAI Chat: gpt-4-32k (aliases: 4-32k)
  temperature: float - What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
  max_tokens: int - Maximum number of tokens to generate
  top_p: float - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Recommended to use top_p or temperature but not both.
  frequency_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
  presence_penalty: float - Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
  stop: str - A string where the API will stop generating further tokens.
  logit_bias: Union[dict, str, NoneType] - Modify the likelihood of specified tokens appearing in the completion.
Markov: markov
  length: int
  delay: float
PaLM 2: chat-bison-001 (aliases: palm, palm2)
gpt4all: orca-mini-3b - Orca (Small), 1.80GB download, needs 4GB RAM (installed)
gpt4all: ggml-gpt4all-j-v1 - Groovy, 3.53GB download, needs 8GB RAM (installed)
gpt4all: orca-mini-7b - Orca, 3.53GB download, needs 8GB RAM (installed)
gpt4all: ggml-replit-code-v1-3b - Replit, 4.84GB download, needs 4GB RAM (installed)
gpt4all: ggml-vicuna-13b-1 - Vicuna (large), 7.58GB download, needs 16GB RAM (installed)
gpt4all: nous-hermes-13b - Hermes, 7.58GB download, needs 16GB RAM (installed)
gpt4all: ggml-model-gpt4all-falcon-q4_0 - GPT4All Falcon, 3.78GB download, needs 8GB RAM
gpt4all: ggml-vicuna-7b-1 - Vicuna, 3.92GB download, needs 8GB RAM
gpt4all: ggml-wizardLM-7B - Wizard, 3.92GB download, needs 8GB RAM
gpt4all: ggml-mpt-7b-base - MPT Base, 4.52GB download, needs 8GB RAM
gpt4all: ggml-mpt-7b-instruct - MPT Instruct, 4.52GB download, needs 8GB RAM
gpt4all: ggml-mpt-7b-chat - MPT Chat, 4.52GB download, needs 8GB RAM
gpt4all: orca-mini-13b - Orca (Large), 6.82GB download, needs 16GB RAM
gpt4all: GPT4All-13B-snoozy - Snoozy, 7.58GB download, needs 16GB RAM
gpt4all: ggml-nous-gpt4-vicuna-13b - Nous Vicuna, 7.58GB download, needs 16GB RAM
gpt4all: ggml-stable-vicuna-13B - Stable Vicuna, 7.58GB download, needs 16GB RAM
gpt4all: wizardLM-13B-Uncensored - Wizard Uncensored, 7.58GB download, needs 16GB RAM
Mpt30b: mpt30b (aliases: mpt)
  verbose: <class 'bool'>

I don't like it outputting the same help multiple times though. I'm going to have it only output the detailed descriptions once per model of each class.

simonw commented 1 year ago

Tests are failing now - docs/usage.md is detected as changed by cog on Python 3.8 for some reason.

simonw commented 1 year ago

Could be related to plugin order. It shouldn't though, I would expect this order to be the same each time with only the openai default plugin installed: https://github.com/simonw/llm/blob/18f34b5df25f20afceaa6a85dbd55d2b6dd47cb3/llm/__init__.py#L49-L56

simonw commented 1 year ago

Could be the order of this bit: https://github.com/simonw/llm/blob/18f34b5df25f20afceaa6a85dbd55d2b6dd47cb3/llm/cli.py#L381

simonw commented 1 year ago

Actually the problem was something else. I copied the generated text to my local environment and did a diff and got this:

diff --git a/docs/usage.md b/docs/usage.md
index 8e76010..dfc4c16 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -98,53 +98,53 @@ cog.out("```\n{}\n```".format(result.output))
 OpenAI Chat: gpt-3.5-turbo (aliases: 3.5, chatgpt)
-  temperature: float
+  temperature: Union[float, NoneType]
     What sampling temperature to use, between 0 and 2. Higher values like
     0.8 will make the output more random, while lower values like 0.2 will
     make it more focused and deterministic.
-  max_tokens: int
+  max_tokens: Union[int, NoneType]
     Maximum number of tokens to generate
-  top_p: float
+  top_p: Union[float, NoneType]
     An alternative to sampling with temperature, called nucleus sampling,
     where the model considers the results of the tokens with top_p
     probability mass. So 0.1 means only the tokens comprising the top 10%
     probability mass are considered. Recommended to use top_p or
     temperature but not both.
-  frequency_penalty: float
+  frequency_penalty: Union[float, NoneType]
     Number between -2.0 and 2.0. Positive values penalize new tokens based
     on their existing frequency in the text so far, decreasing the model's
     likelihood to repeat the same line verbatim.
-  presence_penalty: float
+  presence_penalty: Union[float, NoneType]
     Number between -2.0 and 2.0. Positive values penalize new tokens based
     on whether they appear in the text so far, increasing the model's
     likelihood to talk about new topics.
-  stop: str
+  stop: Union[str, NoneType]
     A string where the API will stop generating further tokens.
   logit_bias: Union[dict, str, NoneType]
     Modify the likelihood of specified tokens appearing in the completion.
 OpenAI Chat: gpt-3.5-turbo-16k (aliases: chatgpt-16k, 3.5-16k)
-  temperature: float
-  max_tokens: int
-  top_p: float
-  frequency_penalty: float
-  presence_penalty: float
-  stop: str
+  temperature: Union[float, NoneType]
+  max_tokens: Union[int, NoneType]
+  top_p: Union[float, NoneType]
+  frequency_penalty: Union[float, NoneType]
+  presence_penalty: Union[float, NoneType]
+  stop: Union[str, NoneType]
   logit_bias: Union[dict, str, NoneType]
 OpenAI Chat: gpt-4 (aliases: 4, gpt4)
-  temperature: float
-  max_tokens: int
-  top_p: float
-  frequency_penalty: float
-  presence_penalty: float
-  stop: str
+  temperature: Union[float, NoneType]
+  max_tokens: Union[int, NoneType]
+  top_p: Union[float, NoneType]
+  frequency_penalty: Union[float, NoneType]
+  presence_penalty: Union[float, NoneType]
+  stop: Union[str, NoneType]
   logit_bias: Union[dict, str, NoneType]
 OpenAI Chat: gpt-4-32k (aliases: 4-32k)
-  temperature: float
-  max_tokens: int
-  top_p: float
-  frequency_penalty: float
-  presence_penalty: float
-  stop: str
+  temperature: Union[float, NoneType]
+  max_tokens: Union[int, NoneType]
+  top_p: Union[float, NoneType]
+  frequency_penalty: Union[float, NoneType]
+  presence_penalty: Union[float, NoneType]
+  stop: Union[str, NoneType]
   logit_bias: Union[dict, str, NoneType]

So clearly on Python 3.8 this bit of code has a different output: https://github.com/simonw/llm/blob/18f34b5df25f20afceaa6a85dbd55d2b6dd47cb3/llm/cli.py#L382-L384

simonw commented 1 year ago

To test locally I ran:

pyenv install 3.8.17

Then waited for that to compile.

Then:

~/.pyenv/versions/3.8.17/bin/python -m venv /tmp/pvenv
source /tmp/pvenv/bin/activate
pip install -e '.[test]'
/tmp/pvenv/bin/cog --check docs/usage.md

And to rewrite it:

/tmp/pvenv/bin/cog -r docs/usage.md
simonw commented 1 year ago

Extracted a TIL: https://til.simonwillison.net/python/quick-testing-pyenv

simonw commented 1 year ago

Now documented here: https://llm.datasette.io/en/latest/usage.html#listing-available-models - including a cog powered example.