Support accelerator directive for Azure Batch

bentsherman commented 1 year ago

The accelerator directive is currently supported for AWS Batch, Google Batch, and Kubernetes. Azure Batch also has GPUs, so this directive should also work for Azure Batch.

Not sure yet how to implement it yet with the Azure Batch Java SDK. All I found so far is the GpuResource class. @abhi18av do you have any idea how we would support GPU requests here? I don't know if there are specific GPU-enabled instance types, or if we can just attach a GPU to any instance or what.

In any case, I think we can map the accelerator type to the GpuSku enum:

final accel = task.config.getAccelerator()
new GpuResource()
    .withCount( accel.getRequest() )
    .withType( GpuSku.fromString(accel.getType()) )

So an example request would look like:

accelerator 1, type: 'V100'

vsmalladi commented 1 year ago

@bentsherman can you post how you would describe the type for each cloud and how that would be reproducible?

Right now I see 'k80' in azure, vs 'nvidia-tesla-k80' in AWS

bentsherman commented 1 year ago

Nextflow doesn't try to provide a uniform interface, it just uses the same terminology as the cloud provider, so that it's as thin of a wrapper as possible. So for azure we would use P100, V100, etc to match the options in GpuSku.

abhi18av commented 1 year ago

@bentsherman I think that the implementation of GPU on Azure Batch seems to be documented on the links below

Basically, it seems like the the chosen GPU could only be connected to an NC type of VM, therefore if the accelerator directive needs to be implemented for AZB then the users would need to make sure that they have access to that particular VM - right @vsmalladi ?

vsmalladi commented 1 year ago

@abhi18av that is correct they are these machines: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu

Summary:

NCv3-series: V100
NCasT4_v3-series: T4 or
ND A100 v4-series: A100
NVv3-series: Tesla M60
NVv4-series: AMD Radeon Instinct MI25
NDm A100 v4-series: A100

abhi18av commented 1 year ago

Thanks Venkat!

@bentsherman , in this case what kind of implementation of accelerator you're thinking? Afaict, since the GPUs are not attachable to all kinds of machines, then perhaps this might need to be mapped onto the queue and Azure Batch pool mechanism 🤔

Here's what I am imagining, if the accelerator is mentioned

Ignore the main pool ( or queue ) for a process
Then spawn another GPU VM pool (if not already specified) and then submit tasks there

vsmalladi commented 1 year ago

Ya we can do that. @abhi18av can you determine the vm mapping from the compute resources requested on the fly?

abhi18av commented 1 year ago

I'd say that we might model this using a GPU-focused nomenclature. Keeping future friendliness in mind and accounting for deprecated Vm types, let's take the NCasT4_v3 series as an example then the directive might look something like

process <NAME> {

accelerator { family: 'NCasT4_v3' , gpus: 1, memory: 16.GB, data_disk: 8.GB } 
...
...

}

And then we can infer the exact VM which would be needed to deploy the pool and hence override the process.queue for that process

adamrtalbot commented 1 year ago

stale[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

iferres commented 4 months ago

Hi, is this a thing? Should we use a docker image with the nvidia drivers installed? Or should we use a custom VM with all the requirements already installed?

bentsherman commented 4 months ago

@adamrtalbot do you think you could write a section for the Nextflow/Azure docs explaining how to use GPUs with the current approach? I recall there was some way to do it through the container image which automatically gets mapped to a GPU-enabled VM, etc. That would also help us evaluate the benefit of supporting the accelerator directive explicitly.

adamrtalbot commented 4 months ago

On Azure, if you use a machine with a GPU it automatically mounts the GPU to the container, no additional options required. When using the autopools feature, if you specify the process.machineType to one of these it should work automatically.

It could be automated to restrict the machineType to that set of VMs when accelerator != null, which I think is what @abhi18av is saying in this comment.

Personally, I've just used the queue directive to point at the relevant Azure Batch pool composed of GPU enabled machines. So I guess documentation would just be, "to use a GPU enabled VM, select a GPU enabled VM"?

nextflow-io / nextflow

Support accelerator directive for Azure Batch #3789