Closed shefty closed 7 years ago
@j-xiong @sayantansur @sungeunchoi @hppritcha @jeffhammond @pkcoff -- To assist in defining this, I need to understand better how job keys are used currently, along with other details. First, are there any problems with the description for this issue? Second, would a 64-bit key be sufficient for all existing networks?
I know that Omni-Path defines a 16-bit job key. I'm OK with the description.
For example, it can be assigned to every transmit/receive operation, or a single key can be associated with a specific endpoint.
The former seems like an unnecessary overhead. Associating these keys with endpoints makes sense.
@pkcoff should talk to Venkat about this. We should also invite feedback from e.g. ADIOS folks. These I/O and/or analysis folks seem to have nontrivial usage models for this type of feature.
The 16 bit OPA-1 job key is managed by the driver and serves as a way to prevent cross traffic between different users. We can ignore that for PSM2. The real job separation in PSM2 is through the UUID, which is 16 bytes.
From: Sayantan Sur [mailto:notifications@github.com] Sent: Friday, November 18, 2016 2:16 PM To: ofiwg/libfabric libfabric@noreply.github.com Cc: Xiong, Jianxin jianxin.xiong@intel.com; Mention mention@noreply.github.com Subject: Re: [ofiwg/libfabric] job key support (#2529)
I know that Omni-Path defines a 16-bit job key. I'm OK with the description.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ofiwg/libfabric/issues/2529#issuecomment-261656492, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIH2LUB6SzqPQUcQ3IT2toQYhnppn8Zkks5q_iO2gaJpZM4KzFMq.
For a job key to work, doesn't it need to be carried in the message protocol? What does psm2 do with a 16B UUID?
The psm2 code is large and difficult to follow, but it appears that it uses a 16-bit job key. The 16-byte value passed in through the API and labeled as a "job_key" isn't an actual job key. I'm not sure what it is. I see a job key hashed to an 8-bit value, 16-bit protocol fields, uses for a 32-bit value, and a uuid... But I'm not sure it matters.
I can think of a couple of ways that a provider could implement some sort of feature related to this. A direct way is to carry the key in every packet or message header. The receiver needs to verify the key before accepting the message. Alternatively, keys could be exchanged as part of a communication setup, which would authorize the source as a verified sender. The latter allows for arbitrarily large keys.
I don't want the API to limit the implementation, so I think we need to support very large keys, which I will henceforth refer to as "authorization" keys, rather than job keys.
Sounds good to me.
What level of exclusivity are you planning for these keys? Does this form a logical partition of the fabric?. Should EPs be allowed to have multiple keys, or only a single key?
Each endpoint would be allowed a single key. An app would need to use multiple endpoints if they needed to use multiple keys. The app would have access to multiple keys through the fabric structure. See the initial proposal at:
https://github.com/shefty/libfabric/commit/492c72bd47dce0aae1cb491a9ef90ad8baf7cb04
Adding multiple keys per endpoint would essentially make the keys a per operation input value.
Adding multiple keys per endpoint would essentially make the keys a per operation input value.
I thought so. I just wanted to clarify.
So this has been a hot topic with the GNI provider.
From the perspective of the GNI provider, the proposal seems fine. There are a few things that raise issues:
For example, it can be assigned to every transmit/receive operation, or a single key can be associated with a specific endpoint. (There may be other options.)
The domain seems to be the appropriate level for this type of API. A domain would represent the entirety of a job's partition of the network. Additional domains could use different protection keys, creating a different partition of the fabric.
Similarly, when dealing with RMA, the registered memory region should only be exposed to a specific group of peers. This could be done by adding a job key as part of the registration request, or associating the registration with an endpoint.
I believe this is something already accomplished by the domain limitations. Memory regions are tied to a domain, and cannot be used in a different domain. This seems like a natural use of a domain, and not something that should be pushed down the stack.
"A domain defines the boundary for associating different resources together." -- The fi_domain man page.
To assist in defining this, I need to understand better how job keys are used currently, along with other details.
We use a WLM provided protection key within the GNI provider when creating communication domains. These domains are used to create the GNI endpoints that ultimately carry the traffic to and from each instance. The protection key protects traffic from a GNI endpoint in the system from interacting with another GNI endpoint in the system if they don't share the same protection key. The endpoints inherit the protection key from the communication domain they were associated with.
Our internal discussions on the matter are still on-going. There is support for the notion of it being a per-domain field that is inherited by endpoints under the domain, and there is support of the auth key being used on a per-transaction basis.
Should we consider multiple levels of support for this type of feature that could work for most, if not all providers? Similar to the FITHREAD* family of thread safety levels, perhaps providers could provide multiple levels of protection key inheritance that they support.
I prefer domain level inheritance because it feels like a clear level of separation, but others might prefer a per-ep level of inheritance because it fits the model of their hardware or fabric.
In this manner, models could be supported in descending order of support by providers willing to implement them.
The added protection key fields to APIs with providers that only support to the EP or DOM levels of inheritence could ignore or inherit the pkey from the respective domain or ep rather than taking the API value.
Thoughts?
For example, it can be assigned to every transmit/receive operation, or a single key can be associated with a specific endpoint. (There may be other options.)
I selected per endpoint, as that minimizes the changes to the existing interfaces, and will likely provide better performance for most use cases.
The domain seems to be the appropriate level for this type of API. A domain would represent the entirety of a job's partition of the network. Additional domains could use different protection keys, creating a different partition of the fabric.
I placed the full set of keys with the fabric, but maybe a domain is a better match. This needs more discussion.
Similarly, when dealing with RMA, the registered memory region should only be exposed to a specific group of peers. This could be done by adding a job key as part of the registration request, or associating the registration with an endpoint.
I added the authorization key to the memory registration request.
Could we consider multiple levels of support for this type of feature that could work for most, if not all providers? Similar to the FITHREAD* family of thread safety levels, perhaps providers could provide multiple levels of protection key inheritance that they support. I prefer domain level inheritance because it feels like a clear level of separation, but others might prefer a per-ep level of inheritance because it fits the model of their hardware or fabric.
We can, but I'm really wanting to start reducing the number of options that apps need to deal with. This also impacts the API, as the keys must be provided in different calls. Per transaction is particularly painful to expose.
Two thoughts for now after looking at shefty@492c72b :
One auth key per domain seems like a better fit to me. That way, MRs, CQs, and other domain objects are protected by the auth key without any further API changes. It's not clear to me how allowing an auth key per EP will enable better performance for a given HW.
Most jobs will probably be able to initialize with an auth key in the list returned from fi_getinfo() and be done with it. However, some jobs may want to use keys that don't become accessible until after a job launches. To support using keys that are not advertised by the fi_info structure returned from fi_getinfo at the start of a job, we may need a way to set a domain's job key with a provider specific value. Something like:
struct fi_domain_attr {
...
enum auth_key_type // index or key value types
size_t auth_key_index;
void *auth_key
size_t auth_key_len
...
One auth key per domain seems unnecessarily restrictive. There could be use cases where one domain object is used to connect multiple jobs. For example, in MPI dynamic job connection, the MPI library can keep one Completion queue, but have different endpoints per inter-communicator. If we have just one domain key, MPI is forced to create multiple completion queues and poll them.
I think similar arguments can be made for when compute jobs connect to object based I/O servers where the compute job itself might have a different key than the I/O objects it is permitted to access.
@shefty You quoted the original text with some of your responses.
I selected per endpoint, as that minimizes the changes to the existing interfaces, and will likely provide better performance for most use cases.
Would there be much, if any, change if this were per-domain rather than per-endpoint?
What use cases are you envisioning? I've discussed this issue with the IO/ADIOS folks in the past so I have some context for the use cases that can occur with this type of feature.
I placed the full set of keys with the fabric, but maybe a domain is a better match. This needs more discussion.
Are you talking about the list that you've proposed in the fabric?
I added the authorization key to the memory registration request.
Couldn't the authorization key simply be tied to the domain? If anything, that would make the logic for the provider super simple because now the endpoints in the domain are already limited to using the memory registrations created within the domain. Adding any fields to lower-level APIs would be unnecessary.
As I understand this proposal, you already need to create endpoints to use different authorization keys, and you also need to create memory registrations for those endpoints to use because they would need to be under the same auth key.
We can, but I'm really wanting to start reducing the number of options that apps need to deal with. This also impacts the API, as the keys must be provided in different calls. Per transaction is particularly painful to expose.
I agree that per-transaction is painful to expose, so I don't recommend it. I only want to provide a more flexible solution in the event that other providers would want to support a different model.
Whether we use the domain as the logical splitting point (by embedding the auth key), or use the authorization key as a separate logical splitting point seems to be the crux of the discussion.
What should the domain be used for? Is it meant for splitting resource groups locally, or globally?
For example, in MPI dynamic job connection, the MPI library can keep one Completion queue, but have different endpoints per inter-communicator.
That inter-communicator for the external parallel job represents a completely disjoint subset of resources from the resources in the MPI_COMM_WORLD communicator. This is why I'm saying a separate domain makes sense.
I think similar arguments can be made for when compute jobs connect to object based I/O servers where the compute job itself might have a different key than the I/O objects it is permitted to access.
I'm pretty sure that you and I are thinking of the same use case. I'd say these are distinct resource groups.
I'm not fond of the idea of pushing the logic further down the stack for the benefit of having one CQ.
I'm not fond of the idea of pushing the logic further down the stack for the benefit of having one CQ.
I'm also not fond of polling a dozen CQs unnecessarily. It hurts performance. It is not just a matter of a CQ, but all the objects that are associated with a domain will need to get replicated.
If the provider supports only one key, it can do so by indicating the number of keys it actually supports.
FYI .. some hardware does not support job keys or protection domains at all. This proposal should have a way to expose this fact to the application... :)
Yup :) you can set the count of supported keys to 0 in Sean's proposal.
Re-posting this issue to the ofiwg mailing list to get broader discussion.
I'm also not fond of polling a dozen CQs unnecessarily. It hurts performance. It is not just a matter of a CQ, but all the objects that are associated with a domain will need to get replicated.
That is a downside, but is it a major performance concern? Are there issues with other domain objects? It might help to list the pros/cons on each domain object of associating auth keys with the domain vs. the EP (am I missing other domain objects?):
Object | Domain Keys | EP Keys |
---|---|---|
CQ/CNTR | Con: Must use unique completion objects for each auth. domain. A process can still poll completion objects on separate domains using wait sets or FD wait objects. | Pro: Transactions using separate auth. domains can use same completion stream for separate auth. domains. |
MRs | Pro: No Changes to existing MR API | Con: Changes, added complexity to the MR API. |
AVs | Pro: Separate auth. domains may not use the same set of addresses, so sharing an AV provides no benefit. | Pro: No overhead for extra container. Con: There may be collision problems on networks with hardware support for remapping PE addresses in different auth. domains. |
Con for domain: Disallows interactions, such as triggered operations, between different keys.
Pro for EP: keys seem more naturally associated with an EP or part of the transport protocol.
Pro for domain: Protections, such as MR keys, are currently associated with domains.
We can alleviate the Con of the API change by introducing appropriate flags such that only apps who are interested in using security keys need to change. If I'm reading right, Sean's proposal allows apps to ignore the key (by setting index 0), or using older ABI, in which case the library will ignore the key for you.
Polling CQ is only one thing I picked on. @ztiffany pointed out that not sharing AVs can be pretty hard on memory requirements. Imagine the memory cost of adding an inter-communicator if all the addresses in the giant O(N) table have to be duplicated!
I'm not sure that the AV is an issue. You need 1 AV entry per peer EP, independent of using keys. So I don't think splitting the AV would cause duplications of addresses.
If anything, reducing the AV size would mean allowing multiple keys per EP, which makes the key a per operation parameter. :/
True, scratch the AV memory size issue. I got carried away :)
FYI .. some hardware does not support job keys or protection domains at all. This proposal should have a way to expose this fact to the application... :)
Agreed. However, depending on where is this exposed, it can be done by software if we wanted to support it. Sockets could easily take part in this by rejecting during connection setup if the auth key from the other socket doesn't match.
Con for domain: Disallows interactions, such as triggered operations, between different keys.
In general, that is intent by design. Authorization keys are supposed to prevent interaction between objects utilizing different keys, so I would think this applies to endpoints as well.
Pro for EP: keys seem more naturally associated with an EP or part of the transport protocol.
To me, the keys are more naturally associated with the domain. Each object within a domain has a specific use. For our current set of use cases, there is rarely, if ever, a reason to distinguish resources from other resources using a separate domain. However, the use case presented by auth keys is a clear fit for it.
One reason why you'd use an auth key is join an external parallel job to current application. This generates traffic to the host application that may be completely orthogonal to the traffic within the host application. For example, simple application performing iterative analysis over long periods with the option to inject more data via an external parallel job. The traffic between the ranks of the host application has a specific set of memory registrations, endpoints and resources specific for host. These resources are used for communicating with the host ranks, and not with the external job.
However, the traffic between the host application and the external parallel job may be entirely different in nature. The external job may be attempting to read data (debugging), writing data (data injecting via out-of-band channel), or something else entirely. However, the memory registrations and endpoints associated for this use is specific to the external parallel job, and may not relevant to the other endpoints and memory registrations. The application might decide to forward the data received by the external application to some of the internal host ranks, but maybe not.
We can alleviate the Con of the API change by introducing appropriate flags such that only apps who are interested in using security keys need to change.
The API change will still require newer applications to adopt the new function call and it's arguments regardless of whether it ever intends to leverage the feature. This complicates application design, even if barely.
Anyone who is registering memory in an internal cache may need to differentiate registrations within the same cache by the different auth key. For example, the GNI provider utilizes a memory registration cache to avoid duplicate memory registrations when appropriate since the process of registering memory is expensive for us. Since memory registrations are tied to the domain, caches would now need to take in even more information to support duplicate registrations (same address, same length) with differing auth keys.
Con(EP): If the hardware requires a unique memory registration for endpoints utilizing differing protection/auth keys, then resources will be duplicated anyway.
Pro(EP): Reduced container overhead at the libfabric level. Some resources could be shared between EPs with differing auth keys, but it isn't clear this is true yet.
Pro(Domain): Better logical segmentation. This fits well with larger jobs that may have huge internal structures with large amounts of resources allocated for the host job. Consider the case that you are adding a small parallel job to a large host job. If they shared the same objects, searches for elements of the parallel job would also have search through the host application resources, increasing search times.
Con(Domain): More restrictive on resources than EP.
Pro(Domain): Better logical segmentation. This fits well with larger jobs that may have huge internal structures with large amounts of resources allocated for the host job. Consider the case that you are adding a small parallel job to a large host job. If they shared the same objects, searches for elements of the parallel job would also have search through the host application resources, increasing search times.
I didn't understand this. What is being searched here? Every completion has its own context, so it is easy to find what completed and what part of the application it belongs to by dereferencing the context.
Let's keep in mind that the application can just as easily assign a separate CQ for the external job. i.e. adding auth_key to EP does not take any flexibility away.
Looking from MPI down:
(1) MPI has specific entry points for communication (send/recv) with external jobs. It is easy to reference an external job specific endpoint while initiating any communication call.
(2) MPI libraries are structured to perform anonymous polls. That is, when you need to do network related work, there is usually one routine that gets called that polls all incoming paths, since we don't know where the messages can be coming from. Adding the key to the domain causes changes in the MPI library, since the polling routines now need to be changed, and the entry point is no longer specific to an external job.
Adding the key to the EP retains more flexibility in the app. The app can either choose to just poll one CQ or separate out, if there is a need. MPI, for example, will not want to separate it out due to the above reasons.
I didn't understand this. What is being searched here?
Address vectors.
Indirect references to objects. (memory registrations on keys longer than what is represented by 64 bits)
Internal data structures when resolving objects. (tag matching)
Looking from MPI down
While MPI can make a good reference argument, we need to consider non-MPI cases, especially for the IO domain that is likely to make use of this.
Adding the key to the EP retains more flexibility in the app. The app can either choose to just poll one CQ or separate out, if there is a need. MPI, for example, will not want to separate it out due to the above reasons.
MPI could also just write a wrapper function to cover the polling of two or more CQs instead of one, which would satisfy the above.
Adding the key to the EP retains more flexibility in the app.
Let's discuss why having more flexibility is better or worse in this situation than being restrictive. I think we both agree that endpoint is more flexible and domain is more restrictive, but my argument lies in the fact that it not only meshes well with the documentation (fi_domain), and the nature of the domain. The domain acts as a natural point of separation, where the endpoint is much more granular and can be mixed with other resources that do not share the same attributes.
The restrictive aspect of the domain works to our benefit here as provider developers because it simplifies our logic. Yes, the endpoint option would be more flexible, but many of the underlying structures now need to take the auth key into account now.
Hardware support may not be in place for some provider to do that level of granularity easily either. Depending on where we place this (domain or endpoint), we risk making it more difficult for someone, and easier for someone else.
The questions I'd like to ask: By making this more flexible for the application developer, what do we burden ourselves with? By allowing them to bind it at an EP level, do we encourage a programming model where it is difficult to discern what is wrong because we've mixed objects? Do we allow transactions to occur that should not, in software, because our interpretation of the protection key is looser than what is defined by the hardware? (shefty's triggered ops)
On the flip side, if we make this more restrictive, is it to a point where the developer has no interest in using said feature? What types of costs does a more restrictive approach incur, and can we quantify them in manner that let's us make an informed decision?
I don't want to cite provider-specific examples because what might be easy for one provider is probably going to be more difficult for another provider.
These were the thoughts behind my proposal:
The fabric is associated with multiple auth keys. The thought behind this was to support the app talking with different groups of peers; each group with its own key. By putting the keys with the fabric, the app could make use of separate domains, where each domain represents a NIC. That is, the keys are being assigned to the application, not a specific hardware resource.
This model still supports keys controlled at the domain/NIC level, since fi_getinfo can restrict which domains are reported for which fabric.
I think of a single key being associated with a transport protocol, either embedded within the protocol or part of the endpoint configuration. So, IMO, the key is better associated with a protocol related object, such as the EP, than a resource related object, such as the domain/NIC.
Good questions.
By making this more flexible for the application developer, what do we burden ourselves with? By allowing them to bind it at an EP level, do we encourage a programming model where it is difficult to discern what is wrong because we've mixed objects?
Can you give an example where this would lead to a problem?
All operations that initiate communication are on the endpoint, so any TX operation that matches an RX operation gets checked with the keys. Whether or not the -completion- of these operations are reported on shared objects, or the network address is taken from a shared object doesn't seem like an issue to me.
Do we allow transactions to occur that should not, in software, because our interpretation of the protection key is looser than what is defined by the hardware?
Also, an example will be useful. The way I'm reading this: the provider cannot simply define this looser since that would defeat the point of an authorization key. The provider may implement authorization keys on top of a given NIC using some mechanism that is not trivial for an application to bypass. Ideally, such mechanisms would require support from some system component that only authorized entities can have write access to.
By putting the keys with the fabric, the app could make use of separate domains, where each domain represents a NIC.
To be clear, are you talking about a mapping to a virtual NIC(software construct), or a physical device?
Can you give an example where this would lead to a problem?
First Example: MPI has a habit of calling aborting whenever it believes the sky is falling. Suppose the out-of-band job dies unexpectedly. Do we provide enough information or context to understand that only one of the channels has died as opposed to a dreadful general failure, whether currently or in the future?
Second example:
Suppose we are running a large job (4K PEs or larger). We have large address vectors due to the size of the job. As we add external parallel jobs, those TX,RX and AV entries share space with the host application rank information. How do we avoid address conflicts with the existing addresses even though they have a different auth key? When tags are matched, or addresses resolved, would we have the information to pass to consumers of the libfabric API to inform them that it might be the sharing of resources that may be slowing down their application due to the fact all entries are co-resident? I suspect that some performance issues in the future will likely be due to the fact that too many things are packed in one structure rather than distributed.
In either case, since we allow the EPs to remain in the same domain, do we inherently allow for bad behavior that may be difficult to diagnose because it isn't clear that it is a design issue over a organization issue?
Also, an example will be useful. The way I'm reading this: the provider cannot simply define this looser since that would defeat the point of an authorization key.
Software would need to adhere to the concept of an authorization key, but that allows for interpretation. If hardware prevents X from talking to Y because they don't share the same protection key, then should the provider be allowed to issue a triggered OP from X to Y because they are in the same domain?
Sockets are an example of something that could defy the definition of auth key. We often hold our implementations against sockets to verify correct functionality if sockets supports it. However, if it isn't implemented carefully, it could allow an operation that is invalid that is easily caught by hardware on another provider. Hardware in this case would be correct, but coming to that determination takes valuable time.
A domain is abstract, so it could map to a virtual or physical NIC. I was referring to a physical NIC, but I don't know that it matters. I am viewing the authorization key as defining communication boundaries between applications, not a hardware isolation mechanism.
I am viewing the authorization key as defining communication boundaries between applications, not a hardware isolation mechanism.
I think this is an important topic. There is very little difference here and that's going to make a difference for some providers. For the GNI provider, we use hardware isolation mechanisms in the form of protections keys. Those protection keys are ingrained in our communication domains, which are ultimately tied to a virtual NIC, which is tied to endpoints and memory registrations. This has the effect that all nodes in the job are protected from nodes outside of the job.
Both constructs do similar, if not the same, things, but differentiating them is a problem. Saying you have an auth key in software, without a corresponding key in hardware means very little. I suspect the Intel guys ( @j-xiong ) might be have some feedback here as well based on their earlier response about the Job UUID.
In general, how do you plan to prevent an endpoint with auth key x, from sending an RMA to memory that is not registered to the auth key? Are you planning to embed the auth key into the memory registration key as some of the 64 bytes? The remote side only knows what the key is from the 64 bytes, and doesn't have any attached meta data because the user wasn't aware of the meta data when the key was passed from one rank to the next.
My understanding is that psm/OPA provider enforces job keys as part of the transport. Incoming messages are dropped unless they carry the correct key. A secure process/kernel agent is responsible for assigning which keys a process can use.
With IB, nodes can be limited in their communication through the use of pkeys, but those map to ports, not NICs. And each port can be assigned multiple keys, with the applications selecting which key a specific endpoint will use. Pkeys aren't directly intended to be used as job keys, but are more of a hardware partitioning mechanism. Pkeys are also usable with OPA.
I do think there's a difference between exposing a hardware isolation mechanism, versus an application level authorization key. Hardware isolation mechanism are by nature implementation dependent. OFI intentionally avoids defining hardware specific structures. Virtual fabrics/hardware partitioning could be exposed as separate fabric objects. I doubt we even need to expose the concept of job keys for that model to work. The fabric attributes would limit which nodes an application can communicate with, with the job key(s) implied by the selected fabric/domain.
But within that fabric, the application would be able to communicate with any other application (regardless if a scheduler limits which app runs on a specific virtual fabric). An application level key, on the other hand, allows different apps to use the same virtual fabric, but have restricted communication.
A secure process/kernel agent is responsible for assigning which keys a process can use. (shefty) Most jobs will probably be able to initialize with an auth key in the list returned from fi_getinfo() and be done with it. However, some jobs may want to use keys that don't become accessible until after a job launches. To support using keys that are not advertised by the fi_info structure returned from fi_getinfo at the start of a job, we may need a way to set a domain's job key with a provider specific value. (ztiffany)
Thoughts?
I do think there's a difference between exposing a hardware isolation mechanism, versus an application level authorization key.
I am not saying they should be directly exposed. That would be directly in conflict with making this an abstraction for the various providers. I'm saying that an auth key would be analogous to hardware isolation mechanisms for some providers.
I'm saying that an auth key would be analogous to hardware isolation mechanisms for some providers.
This is overly restrictive, since the process that owns the various objects under domain, such as CQ, Counters, AVs, etc. clearly possesses all the authorization keys. To me, it makes no sense to isolate local accesses to them based on a key.
The key is meant for the -remote- side where it is used to determine whether the incoming operation matches the authorization key that the receiving process intended.
In summary, it does look like keys per EP allows a lot more flexilibity. Sharing triggers, CQs and AVs between auth domains could be helpful and I don't see a reason here to disallow it. It's just unfortunate that MRs aren't created using the libfabric object where protection is done. In hindsight, maybe fi_mr_reg() should have taken an EP.
I think this other point I made got missed:
Most jobs will probably be able to initialize with an auth key in the list returned from fi_getinfo() and be done with it. However, some jobs may want to use keys that don't become accessible until after a job launches. To support using keys that are not advertised by the fi_info structure returned from fi_getinfo at the start of a job, we may need a way to set a domain's job key with a provider specific value. Something like:
struct fi_domain_attr { ... enum auth_key_type // index or key value types size_t auth_key_index; void *auth_key size_t auth_key_len ...
be helpful and I don't see a reason here to disallow it. It's just unfortunate that MRs aren't created using the libfabric object where protection is done. In hindsight, maybe fi_mr_reg() should have taken an EP.
The MR inline calls can support EPs, since they just take a fid as input, and not a domain. (This was intentional to support MR <-> EP bindings in the future). The defined API (MR associated with domain) supports the behavior of IB, iWarp, and RoCE hardware, and is necessary for connection-oriented applications.
Adding registration calls to an EP is possible, but the resulting behavior needs to be defined in an implementation independent way. For example, with IB and iwarp, this feature would map to memory windows, which consume an entry on the transmit queue and generate a completion entry. I'm not convinced that that's the behavior that's desired, but it's what the most widely deployed hardware will do...
the list returned from fi_getinfo() and be done with it. However, some jobs may want to use keys that don't become accessible until after a job launches. To support using keys that are not advertised by the
We lack the definitions needed to report these sort of events. New keys can be used by an app under the current proposal, provided that the app opens a new fabric. Alternatively, there needs to be some mechanism to report auth_key changes to the app using an EQ, plus some impossible synchronization defined to avoid race conditions using older keys.
On MR per EP:
@shefty I think your last comments further point out the fact that we lack a libfabric object that represents a job protection domain. Until this discussion I kind of assumed that the libfabric domain was that object. I think that's what @jswaro was trying to point out also. I don't think it's a deal breaker that we don't have an object that closely maps to a protection domain, but that is where some confusion is coming from.
On dynamic auth keys:
You seemed to throw away the idea of allowing a user to provide their own auth key obtained out of band. Any explanation for that?
Good point on the dynamic auth key. In my survey of usage models, I didn't come across a strong desire to have that support. Taking the MPI example, I found that users were willing to start a job with all the possible keys they are required to support later.
The job, when launched has all the authorizations it needs. This ties into the implementation of the keys as well, as keys can't just be made up and communication of the keys and the association to underlying fabric resources must be secure (in an implementation defined way).
Not saying that this cannot be done dynamically, but I found it fairly hard, and given that most hardware now only supports one or relatively few authorization keys, I didn't think this was a very strong usage model (for now).
The domain is intended to be a resource domain. We don't have a job protection domain object. Figuring that out, if it is even needed, is part of this issue.
I associated an endpoint with a single authorization key. This association is made at EP creation, because that seems the time that is most easily supportable by different implementation choices. EP creation takes as input an fi_info structure. I chose to use an index to select an appropriate key because it made the API safer -- I did not want a pointer here.
Supporting out of band communication is fine, but IMO a network API should not be designed around apps using some other network API in order to make it work. So in-band features are essential.
The list of available keys were stored with the fabric object in a table. It would not be difficult to provide calls for the app to modify that table after fabric creation. We could even go all out and make the table an actual object, like an AV, but that seems like overkill.
The job, when launched has all the authorizations it needs. This ties into the implementation of the keys as well, as keys can't just be made up and communication of the keys and the association to underlying fabric resources must be secure (in an implementation defined way).
This differs from our environment slightly. Not all of the keys available to the fabric will be available at the start of the job.
Applications are authorized to use the network and communicate with other ranks of the same application using the key that was provided at the start of the job. However, if requested, the application can request additional keys through a secure process and share those keys with other applications/jobs.
Not saying that this cannot be done dynamically, but I found it fairly hard, and given that most hardware now only supports one or relatively few authorization keys, I didn't think this was a very strong usage model (for now).
ADIOS and a few other projects have interests in parallel job communication using dynamically allocated authorization keys. Primarily, those cases are the foundation for the API request to add auth keys after a fabric has been created.
We don't have a job protection domain object. Figuring that out, if it is even needed, is part of this issue.
It's an interesting idea.
Supporting out of band communication is fine, but IMO a network API should not be designed around apps using some other network API in order to make it work. So in-band features are essential.
Could you elaborate? If you want to provide API to request/provision additional auth keys, then it sounds like an interesting opportunity. For providers that don't have more than one, or do not support dynamic allocation of authorization keys, then they wouldn't have to implement such functions.
There are use cases for when applications are long running you might start an application separately that needs to gain access, the long running application wouldn't want to share its internal communication key, but set up a separate auth key with the new application. This key could also be revoked. Take a long running data generator application, where a separate visualization component could be started and do rdma's from the data generator. The generator application wouldn't give out the auth key for its internal communication system. Allowing dynamic addition and revocation of these keys would allow for secure communication to be formed. You might have a file server that wants to make sure that while all clients are talking to it each client can only access specific things its authorized for so each client that connects is provided a specific key.
There are specific users who wish to have a secure way of creating HSN links between disparate applications where the security is done not at job-launch but through application API's and internal authorization. It should be on the underlying provider to verify the application actually is authorized to utilize the key.
Does having the key imply authorization?
@jswaro, @jshimek - thanks for the clarifications. Looks like adding support for keys that are discovered dynamically might be useful.
@shefty - just a thought, would it work if we just took the auth_keys and auth_keys out of the struct fi_fabric attribute, and just made it part of the endpoint? To be clear, just this:
struct fi_ep_attr {
...
void *auth_key;
size_t auth_key_len;
}
How the application discovers the key is left outside of the scope of libfabric.
Yes - that should work.
For purposes of clarification, is it expected that the address of two EPs which differ only by their authorization key would be identical? Or said differently, would it be expected that fi_getname
would return the same address for EP_DGRAM/RDM endpoints differing only by the authorization keys they are using?
My understanding is that this would be true. Otherwise, it would make keeping track of lots of addresses a big memory problem. If we can map back to a single location, it's not so bad.
For purposes of clarification, is it expected that the address of two EPs which differ only by their authorization key would have the same address?
No - they are still different endpoints with different addresses. The alternative is to have the key be a per operation parameter, which everyone has rejected as necessary so far.
No - they are still different endpoints with different addresses. The alternative is to have the key be a per operation parameter, which everyone has rejected as necessary so far.
Ugh. In that case, I'm definitely less in favor of this version. If we have to duplicate each endpoint for each job key, this could be a big deal if we use a lot of MPI dynamic processes. I realize that there are currently very few apps that use this, but it could become problem for memory usage if that model starts to become more popular.
Would this be compatible if we attached the job key to the domain?
Several HPC applications make use of job keys to restrict communication between parallel running jobs. There is currently no mechanism in libfabric to support such a feature and would need to be handled out of band. We need to analyze job key support and determine what, if any, changes will be needed to support it.
The request as far as I understand it so far is that each application may be assigned zero or more keys. Each key allows them to communicate with some group of peers. Because the app needs to know which peer group it is talking with, the key should be part of the API. For example, it can be assigned to every transmit/receive operation, or a single key can be associated with a specific endpoint. (There may be other options.)
Similarly, when dealing with RMA, the registered memory region should only be exposed to a specific group of peers. This could be done by adding a job key as part of the registration request, or associating the registration with an endpoint.
Any anticipated solution is likely to at least expand the API (at least new flags), but could also add new fields to some of the existing attributes (domain, mr, endpoint).