gpu-mode / discord-cluster-manager

Hook up GPUs to your Discord channel and start running jobs via DMs!
5 stars 3 forks source link

Status updates #1

Open msaroufim opened 3 weeks ago

msaroufim commented 3 weeks ago

As of d17b626a8cfe7459be8ccb0a9d0c80ea29a3bb5c

Can trigger a github action that runs a script, puts logs in a github artifact and then posts the artifact results to stdout

(discord) ➜  discord-cluster-manager git:(main) python bot.py 
GitHub Action triggered successfully! Run ID: 11675205122
Monitoring progress...
Workflow still running... Status: queued
Live view: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/11675205122

Workflow completed with status: success

Training Logs:
[5 7 9]

View the full run at: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/11675205122
msaroufim commented 3 weeks ago

As of df3d3b308aba3e16256a8e1f738f3340c391ba7a

image

(discord) ➜  discord-cluster-manager git:(main) python discord-bot.py
2024-11-04 17:47:37 - INFO - Environment variables loaded
2024-11-04 17:47:37 - INFO - Using GitHub repo: gpu-mode/discord-cluster-manager
2024-11-04 17:47:37 - INFO - Starting bot...
2024-11-04 17:47:37 INFO     discord.client logging in using static token
2024-11-04 17:47:37 - INFO - logging in using static token
2024-11-04 17:47:38 INFO     discord.gateway Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:38 - INFO - Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:40 - INFO - Logged in as Cluster-Bot#5007
2024-11-04 17:47:45 - INFO - Bot mentioned in message with 1 attachments
2024-11-04 17:47:45 - INFO - Processing attachment: train.py
2024-11-04 17:47:45 - INFO - Downloading train.py content
2024-11-04 17:47:46 - INFO - Successfully read train.py content
2024-11-04 17:47:46 - INFO - Attempting to trigger GitHub action
2024-11-04 17:47:46 - INFO - Looking for workflow 'train_workflow.yml' in repo gpu-mode/discord-cluster-manager
2024-11-04 17:47:46 - INFO - Found workflow, attempting to dispatch
2024-11-04 17:47:47 - INFO - Workflow dispatch result: True
2024-11-04 17:47:49 - INFO - Found 18 total runs
2024-11-04 17:47:49 - INFO - Checking run 11676018557 created at 2024-11-05 01:47:48+00:00
2024-11-04 17:47:49 - INFO - Found matching run with ID: 11676018557
2024-11-04 17:47:49 - INFO - Successfully triggered workflow with run ID: 11676018557
2024-11-04 17:47:50 - INFO - Starting to monitor workflow status for run 11676018557
2024-11-04 17:47:50 - INFO - Current status: queued
2024-11-04 17:48:21 - INFO - Current status: completed
2024-11-04 17:48:21 - INFO - Workflow completed, downloading artifacts
2024-11-04 17:48:21 - INFO - Attempting to download artifacts for run 11676018557
2024-11-04 17:48:22 - INFO - Found 1 artifacts
2024-11-04 17:48:23 - INFO - Found artifact: training-logs
2024-11-04 17:48:23 - INFO - Successfully downloaded artifact
msaroufim commented 2 weeks ago

Threaded replies now work as of c1e2b1aaac9ed99b66640c9a57e1b9911e8a7c6d Screenshot 2024-11-05 at 10 35 25 AM

msaroufim commented 2 weeks ago

Got caching of torch working

Screenshot 2024-11-05 at 11 17 10 AM

EDIT: Actually this didn't work lol, using cache takes as much time as not using the cache

msaroufim commented 2 weeks ago

The bot is now always on, basically if you make an update to main then heroku will catch the changes and automatically redeploy

I get emails if the bot ever crashes and otherwise can check the status here https://dashboard.heroku.com/apps/discord-cluster-manager

To repro

 1281  git checkout -b msaroufim/heroku
 1283  brew tap heroku/brew && brew install heroku
 1284  heroku login
 1285  heroku git:remote -a
 1286  heroku git:remote -a discord-cluster-manager
 1287  heroku config:set 
 1310  heroku logs --tail\n\n
 1312  heroku ps:scale worker=1
 1313  heroku ps

So testing just got significantly simpler

Screenshot 2024-11-06 at 12 20 06 PM

Screenshot 2024-11-06 at 12 21 38 PM

msaroufim commented 2 weeks ago

Server health can now be monitored here Screenshot 2024-11-06 at 12 33 27 PM

AndreSlavescu commented 2 weeks ago

Example leaderboard command usage:

@Cluster-Bot leaderboard

image
msaroufim commented 2 weeks ago

can now queue gpu jobs to the AMD runner https://github.com/gpu-mode/discord-cluster-manager/pull/16

msaroufim commented 2 weeks ago

NVIDIA jobs now working https://github.com/gpu-mode/discord-cluster-manager/pull/17

Screenshot 2024-11-11 at 4 35 11 PM
msaroufim commented 2 weeks ago

Bot does not create a new message to then thread

Screenshot 2024-11-11 at 5 31 05 PM
msaroufim commented 1 week ago

Can now support arbitrary filenames and not just train.py Screenshot 2024-11-18 at 10 07 35 AM

msaroufim commented 1 week ago

AMD runners now are connected

Screenshot 2024-11-18 at 10 41 41 AM

msaroufim commented 1 week ago

Modal scheduler is now merged https://github.com/gpu-mode/discord-cluster-manager/pull/25

Fastest scheduler we have so far for python jobs

Screenshot 2024-11-18 at 6 40 25 PM

msaroufim commented 6 days ago

Major update Slash commands now work and make usage instructions super seamless now

https://github.com/gpu-mode/discord-cluster-manager/pull/27

run github/modal/resync/ping

msaroufim commented 4 days ago

Major refactor landed by @S1ro1 which modularizes our codebase - new commands or functionality can be split into seperate cogs and now accepting new contributions will be easier