Closed ariG23498 closed 2 years ago
I'm having the same issue with runs being overwritten in the UI. I would really appreciate if the wandb team could comment on if this behavior is not supported or if I am doing something wrong. Many thanks!
Hey @phil-fradkin you'll need to provide the code you're running that's causing the unexpected behavior for us to advise.
Right so in general I would like to set up a sweep where every single one of my runs is actually a cross validation. Instead of getting a single metric from a set of hyperparameters the sweep would either get k (for kfold) or I could take the average. At a high level my code looks like this:
args = get_args()
group_id = timestamp()
for i in range(3):
wandb_run = wandb.init(project, name, jobtype=group_id, reinint=True)
data = load_data(i)
model = load_model(args)
# inside train model there is wandb.log
train_model(model, data)
summary_dict = evaluation(model, data)
wandb.log(summary_dict)
checkpoint_fp = os.path.join(wandb.run.dir, "checkpoint.pkl")
torch.save(model.state_dict(), checkpoint_fp)
wandb_run.finish()
When I run this script outside the sweep the UI summarizes information across all 3 runs:
The sidebar also visualizes the 3 individual runs:
However when I try to do the same thing in a sweep
For every single job_type
it creates a single run
and has a single loss curve in the UI.
I've tried replacing the job_type with group but the result is the same. Ideally I would like my sweep to either optimize the average of the metric taken across models from different folds. Alternatively it can treat the cross validated models individually.
Please let me know if I should clarify anything else further or this use case isn't supported
Hi @vanpelt does what I wrote make sense or do you need me to provide a working code example?
Hi @vanpelt!
I've tried running the example without modification in wandb version 0.12.6
. I was expecting something like what is linked in the readme here, but this is what I see. There is only one entry corresponding to a null group-- presumably the average over all runs. Do you know how I could tweak the example to get a point in this plot for each job type, corresponding to the average over all runs with that job type (i.e., hyperparameter combo?)
Hi @vanpelt I'm also having some issues with the example. I get the same overwriting behaviour reported above when I run without using multiprocessing. When I add multiprocessing, everything looks like the example on CPU, but when I run on multiple GPUs using
#!/bin/bash
for i in {0..7}
do
CUDA_VISIBLE_DEVICES=$(($i % 8)) wandb agent bchamberlain/research-repo-sheaf_exp/$1 &
done
everything hangs and only 1 GPU is utilised. Thanks in advance for any help on this.
Do you still need help on this? I am closing this issue, feel free to re-open
The example on cross-validation is a little tricky to understand. These were my key takeaways:
train-cross-validation.py
there is asweep_run
that gets initialised.sweep_run
there arenum
number of processes created for each fold.Job-Type
. These runs log theaccuracy
of the respective folds.sweep_run
logs the average accuracy of all the individual folds.average_accuracy
.Let me know if there is something that I missed. My issue is that, the individual runs for each process (k-folds) gets overwritten in the UI. I am not sure as to why this happens. This might be because of the process
.join()
orwandb.join()
method. I have also tried withwandb.finish()
.