I'm trying to run several experiments in parallel on Vertex AI using xmanager. As an example I run the following code from the tutorial notebook adjusted to have > 10 hyperparameter combinations
`async def launch_experiment():
async with xm_local.create_experiment(experiment_title='cifar10') as experiment:
[executable] = experiment.package([
xm.python_container(
executor_spec=xm_local.Vertex.Spec(),
path=os.path.expanduser('/content/xmanager_repo/examples/cifar10_torch'),
entrypoint=xm.ModuleName('cifar10'),
)
])
batch_sizes = [64, 128, 256]
learning_rates = [0.01, 0.001]
momentums = [0.95, 0.90]
trials = list(
dict([('batch_size', bs), ('learning_rate', lr), ('momentum', m)])
for (bs, lr, m) in itertools.product(batch_sizes, learning_rates, momentums)
)
for hyperparameters in trials:
experiment.add(xm.Job(
executable=executable,
executor=xm_local.Vertex(requirements=xm.JobRequirements(T4=1)),
args=hyperparameters,
))`
the jobs successfully run but only in batches of 10 despite my quote for the Vertex API AI set to 20. No other GCP resources are causing a bottleneck when running the job. How can I setup xmanager to run jobs concurrently in groups > 10?
Hi,
I'm trying to run several experiments in parallel on Vertex AI using xmanager. As an example I run the following code from the tutorial notebook adjusted to have > 10 hyperparameter combinations
`async def launch_experiment(): async with xm_local.create_experiment(experiment_title='cifar10') as experiment: [executable] = experiment.package([ xm.python_container( executor_spec=xm_local.Vertex.Spec(), path=os.path.expanduser('/content/xmanager_repo/examples/cifar10_torch'), entrypoint=xm.ModuleName('cifar10'), ) ])
the jobs successfully run but only in batches of 10 despite my quote for the Vertex API AI set to 20. No other GCP resources are causing a bottleneck when running the job. How can I setup xmanager to run jobs concurrently in groups > 10?