Open bentsherman opened 1 year ago
@bentsherman can you post how you would describe the type for each cloud and how that would be reproducible?
Right now I see 'k80' in azure, vs 'nvidia-tesla-k80' in AWS
Nextflow doesn't try to provide a uniform interface, it just uses the same terminology as the cloud provider, so that it's as thin of a wrapper as possible. So for azure we would use P100
, V100
, etc to match the options in GpuSku
.
@bentsherman I think that the implementation of GPU on Azure Batch seems to be documented on the links below
Basically, it seems like the the chosen GPU could only be connected to an NC type of VM, therefore if the accelerator
directive needs to be implemented for AZB then the users would need to make sure that they have access to that particular VM - right @vsmalladi ?
@abhi18av that is correct they are these machines: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
Summary:
Thanks Venkat!
@bentsherman , in this case what kind of implementation of accelerator
you're thinking? Afaict, since the GPUs are not attachable to all kinds of machines, then perhaps this might need to be mapped onto the queue
and Azure Batch pool
mechanism 🤔
Here's what I am imagining, if the accelerator
is mentioned
Ignore the main pool ( or queue
) for a process
Then spawn another GPU VM pool (if not already specified) and then submit tasks there
Ya we can do that. @abhi18av can you determine the vm mapping from the compute resources requested on the fly?
I'd say that we might model this using a GPU-focused nomenclature. Keeping future friendliness in mind and accounting for deprecated Vm types, let's take the NCasT4_v3
series as an example then the directive might look something like
process <NAME> {
accelerator { family: 'NCasT4_v3' , gpus: 1, memory: 16.GB, data_disk: 8.GB }
...
...
}
And then we can infer the exact VM which would be needed to deploy the pool and hence override the process.queue
for that process
Related issue: https://github.com/epi2me-labs/wf-basecalling/issues/6
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, is this a thing? Should we use a docker image with the nvidia drivers installed? Or should we use a custom VM with all the requirements already installed?
@adamrtalbot do you think you could write a section for the Nextflow/Azure docs explaining how to use GPUs with the current approach? I recall there was some way to do it through the container image which automatically gets mapped to a GPU-enabled VM, etc. That would also help us evaluate the benefit of supporting the accelerator
directive explicitly.
On Azure, if you use a machine with a GPU it automatically mounts the GPU to the container, no additional options required. When using the autopools feature, if you specify the process.machineType
to one of these it should work automatically.
It could be automated to restrict the machineType to that set of VMs when accelerator != null
, which I think is what @abhi18av is saying in this comment.
Personally, I've just used the queue
directive to point at the relevant Azure Batch pool composed of GPU enabled machines. So I guess documentation would just be, "to use a GPU enabled VM, select a GPU enabled VM"?
The
accelerator
directive is currently supported for AWS Batch, Google Batch, and Kubernetes. Azure Batch also has GPUs, so this directive should also work for Azure Batch.Not sure yet how to implement it yet with the Azure Batch Java SDK. All I found so far is the
GpuResource
class. @abhi18av do you have any idea how we would support GPU requests here? I don't know if there are specific GPU-enabled instance types, or if we can just attach a GPU to any instance or what.In any case, I think we can map the accelerator type to the GpuSku enum:
So an example request would look like: