Open marcodelapierre opened 2 years ago
I like it! And I wonder if we need to make it easier to parse these feature groups - e.g recent changes to default-_version have a similar kind of logic - check the value and act differently depending on the case, and we would soon have the same for mpi (and maybe others in the future). I can give this a shot at implementation, although I want to work on update first (probably this weekend) since I think the binoc runs and generating incorrect listings!
Haven't gotten to try this out yet - worked on the update functionality today! Not sure I made progress there, but this is next on my TODO to play around with.
okay this is next in my queue @marcodelapierre ! I haven't forgotten!
no rush. my goodness, this week I am completely work flooded, too
I'm not terribly flooded (yet, knock on wood!), but I like working on one shpc new feature at a time! So Linux terms let's just say my brain works fairly serially, or in HPC terms I'm single threaded, within a single project. :thread: :laughing:
Ahah always a great metaphor, love it! 😄 Despite my job, my brain is proudly single threaded, too lol, that's as much as it can do!
[In other SHPC issues, hopefully I will get to comment on the environments/views, it is a very powerful concept, and I do have a scenario to share with you and the other contributors]
@marcodelapierre one quick question! So this approach:
Assumes that a container can only be built for one GPU type. E.g., tensorflow/tensorflow could be matched to nvidia, but not amd. Is that correct? And would we not run into issues with different tags being intended for different gpus? This does feel like something that should still be general in the container recipe to not hard code a bias (e.g.,true/false) but then on a particular install it should be up to the admin to decide the customizations. Our previous approach assumed a center is using one gpu type, and currently the admin would need one "one off" to install the same container name with a different gpu. Is that the action that is annoying / can be improved upon? Some more thinking:
So TLDR: I think we want to make this easy and support it, but we want to ensure that we don't hard code a preference into a container.yaml that might be different / change with tags, and I think we should find the right way to scope this (e.g., scoped in a view I think would make sense!)
Great points, sorry @vsoch I have been swamped these days, trying to catch up!
To be honest, I would tend to consider as very unlikely the case where a single container image tag contains builds for multiple GPU vendors (happy to be proven wrong...). If not for other reasons, because it would imply maintaining both CUDA and ROCm in the same image..which seems an unnecessary nightmare to me. So I think most, if not all, Dockerfile writers would avoid this situation. I am saying this because it seems like, on the SHPC side, supporting a single tag with 2 vendors would add a lot of complexity, so I first stopped for a sec, to think whether it is a likely usage scenario.
But on the other hand...
One scenario which I agree we need to definitely support is the one where different tags of the same image are built for different vendors. To this end.... ....remember we were talking of tag specific customisations in the recipe? this would do, isn't it?
Here is the issue on this feature: https://github.com/singularityhub/singularity-hpc/issues/536
So, bottom line, I agree with you that we need to improve this aspect of this functionality, starting from the case where multiple tags of the same image support distinct vendors.
What do you think?
Thinking more about environments in this context, and your point on AMD+Nvidia containers .... why not?! In the end, once there are envs/views that manage the thing, it would be about enabling having a list of values for the container.yaml GPU feature; then the environment setting will allow to pick the desired one.
I am not really adding much here, just paraphrasing your thoughts, which I can say I support!
Just to loop back here to discussion - when you review #545 think of it in context of some of these questions. E.g., if we can find a way to customize a specific module install (still maintaining symbolic links or something else?) I think we could handle specifics like this.
See my comment on #545, where I suggest the following:
If we restrict the scope of the current issue to single GPU vendor, then I would just suggest to change the functionality inside the container yaml, from
gpu: # True or False
to
gpu: # amd, nvidia or false/null
On the ground that typically a container is only built for one vendor. However, if you think it's better to leave the flexibility, then no update is needed for the current functionality.
This is next in the queue after views! I did start working on it actually but paused with views in case it's a subset of that (which right now it looks like it will be in addition to them).
@marcodelapierre now that we have views could there be a way to allow this additional customization through them?
Hi @vsoch,
I think we could provide the functionality in two ways:
My personal preference is the first, as it still seems to be simple and flexible at the same time. However, we've also learnt that it is good to provide multiple ways to achieve the same setup, as different people/centres will have different preferences. SHPC views seem great in providing this additional flexibility in setups.
This thought came out of the issue on MPI #527, so thanks @georgiastuart for the inspiration!
Current interface of the GPU feature:
I have realised that the current interface does not specify, for a given recipe, whether the corresponding package/container was built for Nvidia or AMD cards, which is known beforehand. As a consequence, this is limiting in the (probably unlikely?) scenario where a centre has both Nvidia and AMD cards.
Updated interface/configuration, for consideration:
--rocm
flag if global setting contains amd, ....ignore if latter is null(?)--nv
flag if global setting contains nvidia, ....ignore if latter is null(?)Small implication: update documents, and update the few preexisting recipes which have "gpu: true" (all Nvidia, apart from "tensorflow/tensorflow", for which it is to be checked).
What do you think @vsoch?