deadcafe commented 7 years ago

Hi

I am using this library for DPDK APP, but there are some problems.

CBC mode, CTR mode and GCM are intermingled and are used at the same time in IPsec. So an abstraction layer is being prepared by APP and MB API and GCM API are being used appropriately.

The gcm_data has substance of KEY information.Therefore it's necessary to copy KEY information in gcm_data from IPsec SA and eliminate after use.

If the KEY information in gcm_data is changed to pointer reference from substance, it becomes easier to use.

It's an ideal that GCM is also integrated into MB API, but that makes that a different issue about that.

tkanteck commented 7 years ago

Hi, I understand that it may be difficult for an application to provide common API for, let's say, AES-CTR and AES-GCM because of major differences in the library API's for these modes. DPDK crypto drivers (aesni and gcm) provide good level of abstraction and allow for API unification at application level. I think they also support session-less mode. Perhaps these drivers could fit into your application and reduce overhead of dealing with multiple API's.

The problem I see with gcm_data and the inclusion of key pointer within the structure is that 90% of gcm_data are expanded keys. Upon SA change key expansion operation has to be done. It seems natural to keep gcm_data as part of session structure.

Thanks, Tomasz

deadcafe commented 7 years ago

In the DPDK implementation, the drivers are different between GCM and CBC, so sequence control is required on the upper AP. Considering these implementation costs and performance improvement, I am using direct MB library. In general, extended KEY is kept in IPsec SA. In order to do GCM as it is, gcm_data is required in the SA. However, it is not possible to process multiple packets like the MB API by the same SA. Also, since IPsec SA exists in large quantities, it is not preferable to have a work area inside.

sgmonroy commented 7 years ago

Why do you need sequence control between GCM and CBC? Can you also elaborate regarding "it is not possible to process multiple packets like the MB API by the same SA"?

deadcafe commented 7 years ago

As it is L3, order management is not mandatory. However, since poor firewalls see the order of IP packets, order control may be necessary.

In my AP implementation, in order to process IP packet processing efficiently, we adopted a design that processes multiple packets at the same time. So, regardless of the algorithm, implementation is easy if crypto_request (packet_buffer, num_of_packets), crypto_poll (packet_buffer, sizeof buffer) style can be realized. At this time, when the GCM packet referring to the same SA exists in the buffer, it is preferable that the key information is a pointer reference.

After that, if you add AUTH_TAG at the time of decryption and a counter block at the time of CTR mode into the MB API JOB, session management becomes almost unnecessary.

It is totally unacceptable to use the packet area for AUTH_TAG validation during decoding of the DPDK implementation.

sgmonroy commented 7 years ago

The only scenario where I could see the order between CBC and GCM needs to be preserved is during SA re-keying where the new SA has a different algo than the old SA. Otherwise, to keep order of L4/TCP flows, each flow should match a single SA, which means either CBC or GCM, therefore no need to keep order between CBC and GCM.

I am not understanding your point regarding "GCM packet referring to the same SA". A session/context is related to an SA algo and key. Thus, all packets that belong to the same SA should use the same session/context. The session/context is just some static data that can be pre-processed, thus avoiding having to do it for every single packet. Your suggested model is what DPDK Cryptodev implements as session-less type of operation, where you provide the key. But effectively, you are just generating a session for the associated key before doing the encryption.

I agree that using the packet area for AUTH_TAG validation is not the best approach, but could you provide more details as to why it is totally unacceptable? We could submit a request to change the DPDK approach.

deadcafe commented 7 years ago

Thank you for your advice.

However, since we released the product directly using the MB library, there is a concern that performance will degrade when passing through DPDK. We will consider switching when chained mbuf supports CBC etc.

The problem with AUTH_TAG is that it is impossible to communicate with myself in the size of the MTU.Tunnel MTU needs to be considered up to the outside of AP management.

I decided to implement AUTH_TAG on MB JOB and compare it when I completed HASH. The theme of this proposal is that if the key information of gcm_data is a pointer reference it can easily be unified into the MB API.

sgmonroy commented 7 years ago

There is obviously some overhead of going through DPDK Cryptodev layer, that's the price to pay for a common API.

Re AUTH_TAG, thanks for the additional information. I think I understand your concern which is absolutely legitimate. With the current DPDK approach we could hit corner cases where there is not enough space left in the existing mbuf to store the AUTH_TAG and we would fail to process the packet (I said corner cases because it would depend on the application mbuf size and MTU). This is something that we should change in DPDK.

Back to your issue/proposal, Tomasz already mentioned that ~90% of the gcm_data is key related material, so your proposal would have a performance penalty:

processing GCM packets through MB API.
expanding keys for every packet

deadcafe commented 7 years ago

Ok. I assigned AUTH_TAG to an unused part in the headroom top in quick hack(bad know-how).

The KEY is just changed to pointer reference, and is there impact in performance? gcm_data { key_val; ctx; } -> gcm_data { key_ptr; ctx; } KEY pointer was also in JOB in MB API, so it was the recognition that it's no problem.

POC code which makes GCM MB API was made. It seemed good as POC in make, so it's proposed order. https://github.com/deadcafe/intel-ipsec-mb/tree/GCM_w_MB

deadcafe commented 7 years ago

The essence of my argument is that mapping GCM contexts to SAs is not acceptable. I noticed it now.That is because it is not permanently necessary information like the KEY information. It is not desirable to add meta information that does not need to be persisted, as SA members and SAs themselves are rounded and large in size.

However, I think that it is not a problem if it is smaller than extended AES KEY + HMAC KEY (specifically 512 bytes).

tkanteck commented 7 years ago

Hi deadcafe, I don't fully understand why expanded keys should not be associated with an SA. An SA contains algorithm and key information and whenever key changes it has be expanded for the library purpose. I am also not convinced that having pointers to expanded keys in gcm_data is essential. Please help me understand the use case here.

To me, the proposed change would only make sense if multiple SA's use the same keys. Then same expanded keys could be re-used for memory conservation purpose. This perhaps could be also achieved with multiple SA's pointing to the same gcm_data with expanded keys inside. Such solution shouldn't require code re-work - if this approach cannot be used then please explain why.

(after briefly looking into the pull request) Another potential benefit I see here is: if expanded keys and shifted_hkeyX structures are decoupled from gcm_data then one may avoid expanding keys when switching between AES modes (with the same key size and gcm needs enc keys only). However, I don't think this operation is critical enough to justify the change.

Thanks, Tomasz

deadcafe commented 7 years ago

Hi,

Since it seems to be efficient at the time of GMAC, I tried to separate HASH KEY from AES KEY, but that is not my Issue.

I want to handle CONTEXT separately from const KEY. This is because there is no need to hold CONTEX AREA like the KEY of the SA. Since SA has many numbers, the smaller one is preferred. It might be possible to move CONTEXT into JOB.

At the moment gcm_data is integrated into the MB API as it is.

tkanteck commented 7 years ago

why not link SA with the gcm_data structure through a pointer? app could also define own structure with the key and gcm_data inside in order to keep it separate from the rest of SA. This would't require any library API changes. Please provide reason to make the proposed change and explain how library users will benefit from it.

deadcafe commented 7 years ago

struct SA { cipher_and_authentication_algorithm; struct { aes_expanded_key; hmac_or_xcbc_or_other_expanded_key; CRYPTO CONTEXT; } bundle_key; / == gcm_data / ... };

Is it required to map CRYPTO CONTEXT here? There seems to be an alternative way to map to other areas. It expands the design choices of the program.

I prefer not to reserve CRYPTO CONTEXT and keys in continuous area, but reduce SA size. CRYPTO CONTEXT only needs to be present during the crypto request period.

How does the MB API correspond to chained buffers?

I think that CRYPTO REQUEST will only be executed if all buffers are available, even in the case of chain buffers. Therefore, I believe that CRYPTO CONTEXT is not essential to SA.

Since we are using hyperthreading, we need to save huge pages to avoid TLB misses.

However, if the key and CONTEXT are contiguous areas and the majority of the performance contributes, we abandon this proposal as a cost.

tkanteck commented 7 years ago

Looking into least invasive but elegant solution to this problem. A few solutions possible. I'll come back with some proposals.

tkanteck commented 7 years ago

Considered options:

Very much like the pull request https://github.com/01org/intel-ipsec-mb/pull/8
Mix of option 1 & current state so that applications using current API will not be broken
- new gcm_keys and gcm_context structures are created
- gcm_data doesn't change in terms of the content but uses the two new structures
- new API's added to work with the new structures (like 1 but different names)
- old API stays but is re-implemented to use the new API underneath

Option 2 only makes sense if porting existing code to new API (extra pointer) would be problematic. Option 1 is less code and would be preferred but I am looking for your feedback here.

deadcafe, you need new API for your application so the old one is not important to you. Is it correct? thbtcllt, since your pull request doesn't preserve old API then it is not important to you either. Is it correct? sgmonroy, how the things look like in your case? Is preserving old API important to you?

Thanks, Tomasz

thbtcllt commented 7 years ago

Yes it is correct. For me preserve old API is not important.

deadcafe commented 7 years ago

Me too, old API is unnecessary.

sgmonroy commented 7 years ago

No, it is not important to preserve the old API.

tkanteck commented 7 years ago

Many thanks for your replies.

I am currently working on GCM optimizations which change GCM algorithmic code. It would be much easier to commit optimizations first and then apply the API change (~2 weeks time).

deadcafe, as to wrapping GCM under JOB_AES_HMAC API, since you applied this change in your fork could you measure performance degradation vs direct API model? This data would be helpful in taking the right decision here.

Thanks, Tomasz

deadcafe commented 7 years ago

OK, but now I am busy with private affairs and may take time. Is it enough to simply measure the TSC/Bytes ?

tkanteck commented 7 years ago

Sure, take your time. Perhaps, LibPerfApp code will come handy in measuring both API's - it reports cycles per buffer size. ipsec_diff_tool.py can be used then to translate this data into linear equation which makes comparing bit easier. Thanks, Tomasz

deadcafe commented 7 years ago

Hi tkanteck, I pushed the GCM_w_MB branch corresponding to ipsec_perf. https://github.com/deadcafe/intel-ipsec-mb/tree/GCM_w_MB/PoC/LibPerfApp

It seems to be somewhat slow as it passes parameters with JOB structure.

tkanteck commented 7 years ago

Hi deadcafe, Thanks! it is expected to be slower - the question is how much. I'll run on it on a few systems to determine the cycle increase.

tkanteck commented 7 years ago

Thanks for your inputs. GCM API has been modified to decouple extended keys from the context. See https://github.com/01org/intel-ipsec-mb/commit/46a9d9e4e116f7f183ac1458f920441a6dc28a4e

I am still due to run tests for code wrapping AES-GCM under job API.

In the meantime, one question about IV in AES-GCM. There was an idea to remove need for an application to allocate 16 bytes for IV and pad from 12 to 16. GCM_INIT would automatically extend it to 16 bytes. Any cons against this idea? Would it break any of the use cases?

Thanks, Tomasz

tkanteck commented 7 years ago

Hi deadcafe,

As to wrapping GCM under job API. This what I see on one of my test systems using your fork:

NO ARCH CIPHER DIR HASH KEYSZ SLOPE A INTERCEPT A SLOPE B INTERCEPT B 5 AVX2 GCM ENCRYPT GCM AES-128 0.80453 249.75258 1.08232 342.51895 6 AVX2 GCM ENCRYPT GCM AES-192 0.94058 251.00849 1.15474 356.84141 7 AVX2 GCM ENCRYPT GCM AES-256 1.05036 267.30364 1.23642 379.86626 8 AVX2 GCM DECRYPT GCM AES-128 0.87138 211.18430 1.10187 319.06275 9 AVX2 GCM DECRYPT GCM AES-192 1.03768 208.38300 1.16702 333.78875 10 AVX2 GCM DECRYPT GCM AES-256 1.08966 233.55340 1.24291 357.49102

Legend:

Measured cycles translated into linear equation (slope & intercept)
A is direct GCM interface
B is GCM wrapped under job API

I expected to see increase in intercept but unfortunately increase in slope is seen (cost per byte). It is also quite substantial.

~100 cycles increase in intercept sounds reasonable to me having in mind that job structure has to be filled in, validated and then dispatched. I have no explanation for the slope increase and I think this should not take place. Could you investigate it please?

Thanks, Tomasz

deadcafe commented 7 years ago

ok

deadcafe commented 7 years ago

Hi tkanteck,

I also tried it(ipsec_perf --no-avx512 --shani-off on E5-2697v3). master vs PoC

589 SSE GCM ENCRYPT GCM AES-128 1.20788 218.80770 1.19862 293.48376 590 SSE GCM ENCRYPT GCM AES-192 1.33525 233.74176 1.33454 305.21961 591 SSE GCM ENCRYPT GCM AES-256 1.45537 248.10876 1.45610 325.81902 592 SSE GCM DECRYPT GCM AES-128 1.21061 196.94562 1.20771 276.20251 593 SSE GCM DECRYPT GCM AES-192 1.33536 208.39444 1.33216 285.47281 594 SSE GCM DECRYPT GCM AES-256 1.45602 221.80032 1.45489 304.86823 595 AVX GCM ENCRYPT GCM AES-128 1.20186 215.66634 1.19890 290.35531 596 AVX GCM ENCRYPT GCM AES-192 1.31252 229.67815 1.31237 308.90256 597 AVX GCM ENCRYPT GCM AES-256 1.43237 242.66523 1.48014 315.51649 598 AVX GCM DECRYPT GCM AES-128 1.20855 198.52473 1.20719 274.12168 599 AVX GCM DECRYPT GCM AES-192 1.32267 207.52768 1.32180 288.48388 600 AVX GCM DECRYPT GCM AES-256 1.48432 211.79958 1.49549 293.84498 601 AVX2 GCM ENCRYPT GCM AES-128 1.16401 155.21346 1.16459 222.64592 602 AVX2 GCM ENCRYPT GCM AES-192 1.32148 159.98314 1.32202 229.23401 603 AVX2 GCM ENCRYPT GCM AES-256 1.46517 171.15576 1.46536 241.91400 604 AVX2 GCM DECRYPT GCM AES-128 1.17726 144.43012 1.18001 211.50689 605 AVX2 GCM DECRYPT GCM AES-192 1.31758 154.11356 1.32276 225.00517 606 AVX2 GCM DECRYPT GCM AES-256 1.48110 161.34633 1.48191 235.47761

tkanteck commented 7 years ago

Hi deadcafe,

This data looks reasonable to me. Did you change anything? or were my measurements simply off? Could you create a patch that enables GCM under job API (not in PoC folder but on the main job_aes_hmac.h & others)? I'd like to merge it if no major performance impacts to non-GCM are detected.

Thanks, Tomasz

deadcafe commented 7 years ago

Hi tkanteck,

I fixed the following "PoC/LibPerfApp". The previous code is buggy. I'm sorry. I think that it is necessary to divide the patch into several (ARCH handler, initializer, job scheduler).

tkanteck commented 7 years ago

Let me close this issue as GCM key data has been separated and added under JOB API.

intel / intel-ipsec-mb

the KEY information in gcm_data is changed to pointer reference from substance #7

It is totally unacceptable to use the packet area for AUTH_TAG validation during decoding of the DPDK implementation.

At the moment gcm_data is integrated into the MB API as it is.

Since we are using hyperthreading, we need to save huge pages to avoid TLB misses.