Nebari and Llama 3 introduction

bmwoodruff commented 2 months ago

The nebari server we're using is located at https://possee.openteams.com/.

Login to the server. If you forget your username or password (or need a new one), feel free to DM @bmwoodruff on Zulip.
Lauch a server. To do this, click on the JupyterLab icon, select "Launch Server" a machine. Choose the T4 GPU Instance 1x for using Llama 8B. This is the smallest size needed to run Llama 8B. Any larger size costs more money to run, without providing any benefit for the 8B version. It will take a few minutes for the machine to spin up.
Create a folder in the /shared/users folder with your username.
From the /shared/analyst folder, copy both inference_example_no_chatbox.ipynb and Panel_Chat_Example_GPU_Split.ipynb into the directory you created.
Open Panel_Chat_Example_GPU_Split.ipynb. Familiarize yourself with the file (run each line, experiment with the chatbot).
- One of the settings controls which model to run. Adjust that setting so it picks the 8B model, not the 70B model. The 70B model requires more VRAM than the 1x instance will provide.
- Note each chat has a max number of tokens it can remember while it's processing information. This can be configured, but will consume more memory as we increase the size.
Open a terminal and type the command nvidia-smi You'll see how much memory you are currently using. The example panel chat can use the smaller model Llama-3-8B-Instruct-262k-5.0bpw-h6-exl2 and the lager 70B version. You'll need more memory to run the 70B (you'll have to spin up a higher tier machine, which means more costs associated with compute time). Note that the panel chat can be slow with the 70B version.
When you're not using a chat interface, remember to kill the kernel so that the memory is not constantly being used in the background. You can select Kernel->Shut Down All Kernels.. You can then run nvidia-smi to verify that you've shut down the kernels.
Let's explore how to use Llama 3 without needing a chatbot. Open inference_example_no_chatbox.ipynb. Familiarize yourself with the file (run each line, experiment with the tool). When you're done, remember to shut down all kernels.
The server will automatically shutdown after a period of inactivity. However, we can keep costs lower by remembering to stop services we no longer need. To do this, return to the home page (click the Nebari icon in the upper left). Then use the triple dot (meatballs) menu and select "stop".

I'd love a discussion below about any issues you encountered, ideas about how to use this for the project, etc.

bmwoodruff commented 2 months ago

I see the chatbot as a great introductory exploratory tool. As we're trying to figure out good prompts, how much code to include from a method that's missing documentation, how much of an issue with/without comments to include, etc., the 7B chatbot can provide a quick idea of what capabilities we have. The no_chatbot version can be incorporated into a workflow which then applies what we learned via experimentation over a much larger set of issues, methods missing documentation, etc. I'll be creating another page similar to this which examines how we can use RAG.

bmwoodruff commented 1 month ago

If you encounter a VRAM overflow error, make sure you've adjusted the model_split variable to pick the correct 8B version. You may also need to "Shut Down All Kernels" to free up any VRAM that is still being used. Once you start up the chatbot, it will keep using VRAM till you kill the kernel.

possee-org / genai-numpy

Nebari and Llama 3 introduction #10