Fixed inference on Metal GPU backend + updated docs for run_* programs

svilupp commented 9 months ago

Added CHANGELOG.md file to keep track of changes (we can call it NEWS / HISTORY / ...)
Updated docstrings for run_llama, run_chat, and run_server to be more informative.
Changed default run_* parameters to be more sensible for first-time users, including the run_server port number (port=10897) to be unique and not clash with other services.
Updated run context with the necessary files for fast inference on Metal GPUs (eg, Apple Macbooks M-series)

Default parameter changes:

ngl=99 -> because it's 15x slower without it, so I'd rather push to GPU too much than too little
num-ctx=2048 -> default is 512. Most models will have >=2048. We would artificially knee-cap models and users would find it hard to debug the reason (still risky, but less so)
port=10897 -> to avoid clashes

Added README reference how to start the server.

svilupp commented 9 months ago

As part of this, I've noticed that run_chat is infinite :D CTRL+c control sequence gets eaten by REPL, so we can't kill it... Will open an issue

svilupp commented 9 months ago

This one is good to go.

I'll fix the SIGINT issue separately, but we first need to merge this.

marcom / Llama.jl