google-deepmind / xmanager

A platform for managing machine learning experiments
Apache License 2.0
816 stars 45 forks source link

Vertex AI Compute Quota not used #35

Closed williambankes closed 1 year ago

williambankes commented 1 year ago

Hi,

I'm trying to run several experiments in parallel on Vertex AI using xmanager. As an example I run the following code from the tutorial notebook adjusted to have > 10 hyperparameter combinations

`async def launch_experiment(): async with xm_local.create_experiment(experiment_title='cifar10') as experiment: [executable] = experiment.package([ xm.python_container( executor_spec=xm_local.Vertex.Spec(), path=os.path.expanduser('/content/xmanager_repo/examples/cifar10_torch'), entrypoint=xm.ModuleName('cifar10'), ) ])

batch_sizes = [64, 128, 256]
learning_rates = [0.01, 0.001]
momentums = [0.95, 0.90]
trials = list(
    dict([('batch_size', bs), ('learning_rate', lr), ('momentum', m)])
    for (bs, lr, m) in itertools.product(batch_sizes, learning_rates, momentums)
)
for hyperparameters in trials:
  experiment.add(xm.Job(
      executable=executable,
      executor=xm_local.Vertex(requirements=xm.JobRequirements(T4=1)),
      args=hyperparameters,
  ))`

the jobs successfully run but only in batches of 10 despite my quote for the Vertex API AI set to 20. No other GCP resources are causing a bottleneck when running the job. How can I setup xmanager to run jobs concurrently in groups > 10? Xmanager Issue picture