Open msaroufim opened 3 weeks ago
As of df3d3b308aba3e16256a8e1f738f3340c391ba7a
(discord) ➜ discord-cluster-manager git:(main) python discord-bot.py
2024-11-04 17:47:37 - INFO - Environment variables loaded
2024-11-04 17:47:37 - INFO - Using GitHub repo: gpu-mode/discord-cluster-manager
2024-11-04 17:47:37 - INFO - Starting bot...
2024-11-04 17:47:37 INFO discord.client logging in using static token
2024-11-04 17:47:37 - INFO - logging in using static token
2024-11-04 17:47:38 INFO discord.gateway Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:38 - INFO - Shard ID None has connected to Gateway (Session ID: fda5d70f82bed675973eb8e910f2d9d9).
2024-11-04 17:47:40 - INFO - Logged in as Cluster-Bot#5007
2024-11-04 17:47:45 - INFO - Bot mentioned in message with 1 attachments
2024-11-04 17:47:45 - INFO - Processing attachment: train.py
2024-11-04 17:47:45 - INFO - Downloading train.py content
2024-11-04 17:47:46 - INFO - Successfully read train.py content
2024-11-04 17:47:46 - INFO - Attempting to trigger GitHub action
2024-11-04 17:47:46 - INFO - Looking for workflow 'train_workflow.yml' in repo gpu-mode/discord-cluster-manager
2024-11-04 17:47:46 - INFO - Found workflow, attempting to dispatch
2024-11-04 17:47:47 - INFO - Workflow dispatch result: True
2024-11-04 17:47:49 - INFO - Found 18 total runs
2024-11-04 17:47:49 - INFO - Checking run 11676018557 created at 2024-11-05 01:47:48+00:00
2024-11-04 17:47:49 - INFO - Found matching run with ID: 11676018557
2024-11-04 17:47:49 - INFO - Successfully triggered workflow with run ID: 11676018557
2024-11-04 17:47:50 - INFO - Starting to monitor workflow status for run 11676018557
2024-11-04 17:47:50 - INFO - Current status: queued
2024-11-04 17:48:21 - INFO - Current status: completed
2024-11-04 17:48:21 - INFO - Workflow completed, downloading artifacts
2024-11-04 17:48:21 - INFO - Attempting to download artifacts for run 11676018557
2024-11-04 17:48:22 - INFO - Found 1 artifacts
2024-11-04 17:48:23 - INFO - Found artifact: training-logs
2024-11-04 17:48:23 - INFO - Successfully downloaded artifact
Threaded replies now work as of c1e2b1aaac9ed99b66640c9a57e1b9911e8a7c6d
Got caching of torch working
EDIT: Actually this didn't work lol, using cache takes as much time as not using the cache
The bot is now always on, basically if you make an update to main then heroku will catch the changes and automatically redeploy
I get emails if the bot ever crashes and otherwise can check the status here https://dashboard.heroku.com/apps/discord-cluster-manager
To repro
1281 git checkout -b msaroufim/heroku
1283 brew tap heroku/brew && brew install heroku
1284 heroku login
1285 heroku git:remote -a
1286 heroku git:remote -a discord-cluster-manager
1287 heroku config:set
1310 heroku logs --tail\n\n
1312 heroku ps:scale worker=1
1313 heroku ps
So testing just got significantly simpler
Server health can now be monitored here
Example leaderboard command usage:
@Cluster-Bot leaderboard
can now queue gpu jobs to the AMD runner https://github.com/gpu-mode/discord-cluster-manager/pull/16
NVIDIA jobs now working https://github.com/gpu-mode/discord-cluster-manager/pull/17
Bot does not create a new message to then thread
Can now support arbitrary filenames and not just train.py
AMD runners now are connected
Modal scheduler is now merged https://github.com/gpu-mode/discord-cluster-manager/pull/25
Fastest scheduler we have so far for python jobs
Major update Slash commands now work and make usage instructions super seamless now
https://github.com/gpu-mode/discord-cluster-manager/pull/27
run github/modal/resync/ping
Major refactor landed by @S1ro1 which modularizes our codebase - new commands or functionality can be split into seperate cogs and now accepting new contributions will be easier
As of d17b626a8cfe7459be8ccb0a9d0c80ea29a3bb5c
Can trigger a github action that runs a script, puts logs in a github artifact and then posts the artifact results to stdout