wandb / server

W&B Server is the self hosted version of Weights & Biases
MIT License
254 stars 21 forks source link

The `--git-hash` option in `wandb job create` is not working. #132

Open zfhxi opened 10 months ago

zfhxi commented 10 months ago

I created a job using wandb local:

wandb job create git  https://xxx.git --project="TEST"  \
    --entity="username" --entry-point="main.py" --name="test1" \
    --git-hash="b7baca74dd034cb900ea0e3f48c397ea51c4c481"

the wandb local created the job in the TEST project, and the wandb-job.json:

{
    "_version": "v0",
    "source_type": "repo",
    "runtime": "3.7",
    "source": {
        "git": {
            "remote": "https://xxx.git",
            "commit": "b7baca74dd034cb900ea0e3f48c397ea51c4c481"
        },
        "entrypoint": [
            "python3.7",
            "main.py"
        ],
        "notebook": false
    },
    // ...
}

After that, I had modifed my codes and synced with remote repository, and the commits are as following:

$  git log --pretty=oneline -10
8a6b803c530e800cdf3304d12c6467dcfd655bf5 (HEAD -> main, origin/main) now1001
49ae7364f743c1b699d7a00f51e9805030c38c18 now1000
b7baca74dd034cb900ea0e3f48c397ea51c4c481 now1002
# ...

Then, I launched the job by pushing it to the existing queue:

image

After completing the run, I located the codes cloned from a remote repository by the wandb local server and reviewed the commit:

$ cd "/tmp/tmpavc8q10w" 
$ git log --pretty=oneline -10
8a6b803c530e800cdf3304d12c6467dcfd655bf5 (grafted, HEAD -> main, origin/main) now1001

The expected commit, as specified by --git-hash, should be b7baca74dd034cb900ea0e3f48c397ea51c4c481 rather than the HEAD commit!

The above information indicates that:

  1. The wandb local server clones the latest version of remote repository when launching the job
  2. --git-hash option in wandb job create seems to be not working.

Can anyone help solve this?

rsanandres-wandb commented 10 months ago

Hello! Thank you for sending this information! Could you send a link to your workspace so we can look at it? Only wandb employees will be able to view your project if this is a private project.

Also, could you verify that the launch job you created corresponds to the run id avc8q10w? Just to make sure that we are looking at the same run as the one created.

zfhxi commented 10 months ago

Hello! Thank you for sending this information! Could you send a link to your workspace so we can look at it? Only wandb employees will be able to view your project if this is a private project.

Also, could you verify that the launch job you created corresponds to the run id avc8q10w? Just to make sure that we are looking at the same run as the one created.

Thank you for your response. I've created a demo at https://github.com/zfhxi/test_wandb_launch_job

zfhxi commented 10 months ago

After hours of work, I've found this solution:

import os
import argparse
import subprocess
import sys
from git import Repo

def restart_program():
    p = subprocess.Popen([sys.executable] + sys.argv)
    p.wait()
    print("Fininshed the sub program!")
    sys.exit(0)

def reset_commit(repo, commit_id, workspace):
    commit = repo.commit(commit_id)
    repo.head.reset(commit=commit, index=True, working_tree=True)
    print( f"Workspace {workspace} is checkouting to {commit_id} ...")

def prerun(args):
    # Confirming if the current branch matches the specific job commit
    if bool(args.wandb_job_commit):
        repo = Repo(args.workspace)
        current_commit = repo.head.commit.hexsha
        # assert current_commit == args.wandb_job_commit, f"Current commit {current_commit} is not equal the job commit {args.wandb_job_commit}!"
        if current_commit != args.wandb_job_commit:
            print( f"Current commit {current_commit} is not equal the job commit {args.wandb_job_commit}!") # fmt: skip
            try:
                reset_commit(repo, args.wandb_job_commit)
            except Exception as e:
                print(e)
                print("Trying to fetch the latest 20 commits ...")
                origin = repo.remotes.origin
                repo.git.fetch(origin, "--depth=20")
                reset_commit(repo, args.wandb_job_commit)
            restart_program()
        else:
            print( f"Current commit {current_commit} == job commit {args.wandb_job_commit}!") # fmt: skip
    pass

if __name__=="__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument( "--wandb-job-commit", type=str, default=None, help="validating the commit hexsha") # fmt: skip
    args=parser.parse_args()
    args.workspace = os.path.dirname(os.path.abspath(__file__))

    prerun(args)
    pass
    # main codes

The codes perform the following actions:

  1. Check the current workspace's commit.
  2. Fetch the latest 20 commits from the remote repository.
  3. Switch to a specific commit.
  4. Restart the current script.

I anticipate more elegant solutions!

sydholl commented 7 months ago

WandB Internal User commented: zfhxi commented:

Hello! Thank you for sending this information! Could you send a link to your workspace so we can look at it? Only wandb employees will be able to view your project if this is a private project.

Also, could you verify that the launch job you created corresponds to the run id avc8q10w? Just to make sure that we are looking at the same run as the one created.

Thank you for your response. I've created a demo at https://github.com/zfhxi/test_wandb_launch_job

sydholl commented 7 months ago

WandB Internal User commented: zfhxi commented: After hours of work, I've found this solution:

import os
import argparse
import subprocess
import sys
from git import Repo

def restart_program():
    p = subprocess.Popen([sys.executable] + sys.argv)
    p.wait()
    print("Fininshed the sub program!")
    sys.exit(0)

def reset_commit(repo, commit_id, workspace):
    commit = repo.commit(commit_id)
    repo.head.reset(commit=commit, index=True, working_tree=True)
    print( f"Workspace {workspace} is checkouting to {commit_id} ...")

def prerun(args):
    # Confirming if the current branch matches the specific job commit
    if bool(args.wandb_job_commit):
        repo = Repo(args.workspace)
        current_commit = repo.head.commit.hexsha
        # assert current_commit == args.wandb_job_commit, f"Current commit {current_commit} is not equal the job commit {args.wandb_job_commit}!"
        if current_commit != args.wandb_job_commit:
            print( f"Current commit {current_commit} is not equal the job commit {args.wandb_job_commit}!") # fmt: skip
            try:
                reset_commit(repo, args.wandb_job_commit)
            except Exception as e:
                print(e)
                print("Trying to fetch the latest 20 commits ...")
                origin = repo.remotes.origin
                repo.git.fetch(origin, "--depth=20")
                reset_commit(repo, args.wandb_job_commit)
            restart_program()
        else:
            print( f"Current commit {current_commit} == job commit {args.wandb_job_commit}!") # fmt: skip
    pass

if __name__=="__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument( "--wandb-job-commit", type=str, default=None, help="validating the commit hexsha") # fmt: skip
    args=parser.parse_args()
    args.workspace = os.path.dirname(os.path.abspath(__file__))

    prerun(args)
    pass
    # main codes

The codes perform the following actions:

  1. Check the current workspace's commit.
  2. Fetch the latest 20 commits from the remote repository.
  3. Switch to a specific commit.
  4. Restart the current script.

I anticipate more elegant solutions!