Closed bmwoodruff closed 1 month ago
I see the chatbot as a great introductory exploratory tool. As we're trying to figure out good prompts, how much code to include from a method that's missing documentation, how much of an issue with/without comments to include, etc., the 7B chatbot can provide a quick idea of what capabilities we have. The no_chatbot version can be incorporated into a workflow which then applies what we learned via experimentation over a much larger set of issues, methods missing documentation, etc. I'll be creating another page similar to this which examines how we can use RAG.
If you encounter a VRAM overflow error, make sure you've adjusted the model_split variable to pick the correct 8B version. You may also need to "Shut Down All Kernels" to free up any VRAM that is still being used. Once you start up the chatbot, it will keep using VRAM till you kill the kernel.
The nebari server we're using is located at https://possee.openteams.com/.
Login to the server. If you forget your username or password (or need a new one), feel free to DM @bmwoodruff on Zulip.
Lauch a server. To do this, click on the JupyterLab icon, select "Launch Server" a machine. Choose the T4 GPU Instance 1x for using Llama 8B. This is the smallest size needed to run Llama 8B. Any larger size costs more money to run, without providing any benefit for the 8B version. It will take a few minutes for the machine to spin up.
Create a folder in the
/shared/users
folder with your username.From the
/shared/analyst
folder, copy bothinference_example_no_chatbox.ipynb
andPanel_Chat_Example_GPU_Split.ipynb
into the directory you created.Open
Panel_Chat_Example_GPU_Split.ipynb
. Familiarize yourself with the file (run each line, experiment with the chatbot).Open a terminal and type the command
The example panel chat can use the smaller model
nvidia-smi
You'll see how much memory you are currently using.Llama-3-8B-Instruct-262k-5.0bpw-h6-exl2
and the lager 70B version. You'll need more memory to run the 70B (you'll have to spin up a higher tier machine, which means more costs associated with compute time). Note that the panel chat can be slow with the 70B version.When you're not using a chat interface, remember to kill the kernel so that the memory is not constantly being used in the background. You can select
You can then run
Kernel->Shut Down All Kernels..
nvidia-smi
to verify that you've shut down the kernels.Let's explore how to use Llama 3 without needing a chatbot. Open
inference_example_no_chatbox.ipynb
. Familiarize yourself with the file (run each line, experiment with the tool). When you're done, remember to shut down all kernels.The server will automatically shutdown after a period of inactivity. However, we can keep costs lower by remembering to stop services we no longer need. To do this, return to the home page (click the Nebari icon in the upper left). Then use the triple dot (meatballs) menu and select "stop".
I'd love a discussion below about any issues you encountered, ideas about how to use this for the project, etc.