ovpanait / zynq-aes

AES hardware engine for Xilinx Zynq platform
MIT License
26 stars 10 forks source link

AXI communication cycles #6

Open 55-AA opened 4 years ago

55-AA commented 4 years ago

Hi, I test the hardware algorithm in 4.14.0-xilinx, and compare with software algorithm. By the following result, I thik that AXI communication consume too much cycles. Maybe reduce them, especially for small packet less than 512 bytes.

"ecb(aes)" : software algorithm; "ecb(AES)" : hardware algorithm; The test module was built form 'xilinx-linux/drivers/crypto/tcrypt.c'.

root@box:~# modprobe tcrypt alg="ecb(aes)"
testing speed of ecb(aes) async (ecb(aes-generic)) encryption
test  0 (128 bit key,    16 byte blocks): 1 operation in       630 cycles (   16 bytes)
test  1 (128 bit key,    64 byte blocks): 1 operation in      1769 cycles (   64 bytes)
test  2 (128 bit key,   128 byte blocks): 1 operation in      3377 cycles (  128 bytes)
test  3 (128 bit key,   256 byte blocks): 1 operation in      6537 cycles (  256 bytes)
test  4 (128 bit key,   512 byte blocks): 1 operation in     12884 cycles (  512 bytes)
test  5 (128 bit key,   768 byte blocks): 1 operation in     19235 cycles (  768 bytes)
test  6 (128 bit key,  1024 byte blocks): 1 operation in     26872 cycles ( 1024 bytes)
test  7 (128 bit key,  1536 byte blocks): 1 operation in     38228 cycles ( 1536 bytes)
test  8 (128 bit key,  2048 byte blocks): 1 operation in     50951 cycles ( 2048 bytes)
test  9 (128 bit key,  4096 byte blocks): 1 operation in    103058 cycles ( 4096 bytes)
test 10 (128 bit key,  8192 byte blocks): 1 operation in    203478 cycles ( 8192 bytes)
test 11 (128 bit key, 16384 byte blocks): 1 operation in    408923 cycles (16384 bytes)
test 12 (128 bit key, 32768 byte blocks): 1 operation in    821824 cycles (32768 bytes)
test 13 (128 bit key, 65536 byte blocks): 1 operation in   1656473 cycles (65536 bytes)

root@box:~# modprobe tcrypt alg="ecb(AES)"
testing speed of ecb(AES) async (hwcrypto-ecb) encryption
test  0 (128 bit key,    16 byte blocks): 1 operation in     14731 cycles (   16 bytes)
test  1 (128 bit key,    64 byte blocks): 1 operation in     14554 cycles (   64 bytes)
test  2 (128 bit key,   128 byte blocks): 1 operation in     15877 cycles (  128 bytes)
test  3 (128 bit key,   256 byte blocks): 1 operation in     14026 cycles (  256 bytes)
test  4 (128 bit key,   512 byte blocks): 1 operation in     13528 cycles (  512 bytes)
test  5 (128 bit key,   768 byte blocks): 1 operation in     14063 cycles (  768 bytes)
test  6 (128 bit key,  1024 byte blocks): 1 operation in     16675 cycles ( 1024 bytes)
test  7 (128 bit key,  1536 byte blocks): 1 operation in     20741 cycles ( 1536 bytes)
test  8 (128 bit key,  2048 byte blocks): 1 operation in     23422 cycles ( 2048 bytes)
test  9 (128 bit key,  4096 byte blocks): 1 operation in     36390 cycles ( 4096 bytes)
test 10 (128 bit key,  8192 byte blocks): 1 operation in     54359 cycles ( 8192 bytes)
test 11 (128 bit key, 16384 byte blocks): 1 operation in     95723 cycles (16384 bytes)
test 12 (128 bit key, 32768 byte blocks): 1 operation in    172679 cycles (32768 bytes)
test 13 (128 bit key, 65536 byte blocks): 1 operation in    331273 cycles (65536 bytes)
ovpanait commented 4 years ago

Hi,

I know that for small payloads the transfer overhead is very large and the performance really low, but I am currently not aware of any faster way of transferring data from PS to PL than AXI DMA (and use a standard interface like the kernel crypto API and the XIlinx AXI DMA soft controller). Any suggestion is appreciated :)

Also, I think the overhead mostly comes from the software side (linux kernel crypto API + AXI dma controller driver + interrupts from PL to PS + linux scheduling non-determinism). I used to have a HDL version clocked at 150 MHz and there was no performance improvement over a 100MHz clock design, so the bottleneck seems not to be the HDL engine.

55-AA commented 4 years ago

Hi, I've done a lot of experiments recently, and found that triggering a DMA transfer will spend a larger of cycles. So, I think if hardware can process mulit-packets at one DMA transfer, efficiency should be greatly improved. For this purpose, the cmd DWORD should contain the packet length at high-2byte, so that it can set a soft-tlast for controller, then continue next packet. In addition, in the linux kernel module, a queue is required only, so that cater to linux crypt-engine frame.The SG list can be appended easily for mulit-packets.