questions about QAT instances and nginx performance

intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist

BSD 3-Clause "New" or "Revised" License

406 stars 128 forks source link

questions about QAT instances and nginx performance #34

Closed minyili-fortinet closed 6 years ago

minyili-fortinet commented 7 years ago

Hi, I've installed QAT and QAT_engine to test performance of the new Lewisburg ssl accelerate card. We first tested it with the patched nginx-1.10.3 + openssl-1.1.0f + qat_engine with asynch mode and later will try to transplant it to a multi-thread platfrom. Below is the testing server's info: CPU: E5-2650 v3 @ 2.30GHz * 2 with hyperthreaded ssl card: c6xx(PCIE card) system: Linux ubuntu 3.19.0-15-generic qat driver version: 1.7 (L.1.0.3-42)

Nginx test was running with 32 worker processes to fit the qat configuration and we got 41000+ CPS with cipher suite RSA2k+AES256-gcm-SHA384. The whole cpu usage is around 60% (the core binded with worker thread has 25% idle) but the CPS couldn't go further. We also tried to change the qat config file to turn on the LimitDevAccess.

[SHIM]
NumberCyInstances = 1
NumberDcInstances = 0
#we will setup 40 nginx worker processes so set it to 14
NumProcesses = 14
LimitDevAccess = 1

Cy0Name = "UserCY0"
Cy0IsPolled = 1
Cy0CoreAffinity = 1

With the config file above we tested nginx with 40 worker processes and got 39000+ CPS result in the same testing environment with around 60% total cpu usage, which was lower than the 32 worker processes case. It seems that the performance goes worse with more worker processes. Here are some questions: First of all, is the performance result normal? It seems the bottle neck is not the CPU. And I got 110k RSA2k-sign ops result in openssl speed test so maybe the ssl card is still avaliable to do the cipher jobs?Are there any ways to check whether the ssl card is running in full load?

I wonder if the LimitDevAccess configuration will take bad influence on performance but I have to enable it to extend the NumProcesses. Are there any ways to extend the NumProcesses to fit the 40 worker processes of nginx? I've checked the qat document and found qat programer's guide describe an approach to increase the maximum num of processes in 4.3.2.2. I've tried to set "ServicesEnabled" to "asym;sym" but error happened when starting the qat_service.

Thanks

stevelinsell commented 7 years ago

Hi @minyili-fortinet,

There are a few things to check before we delve too deeply into this:

Firstly make sure nginx is shutdown and cycle the qat service by doing:

/etc/init.d/qat_service restart

Then can you bring nginx back up and run your CPS test followed by the following command:

cat /sys/kernel/debug/qat_c6xx_*:00.0/fw_counters

This will dump the counters showing the number of offload requests to each of the 3 endpoints. If everything is working well you should be getting a fairly even spread of requests across all 3 endpoints. If you can paste that output here it would be great, if not a description of what you see will suffice. Based on the config file you have posted I would expect the traffic to be spread reasonably evenly (probably a tiny bit lower on the third endpoint but that would be expected).

Secondly can you check the "ServicesEnabled" setting you mention in each of the 3 driver config files. In each file it should be set as follows to just enable the crypto:

[GENERAL]
ServicesEnabled = cy

Also ensure each of the 3 driver config files is setup identically in terms of the [SHIM] section. As per your comment 'NumProcesses = 14' is a good setting for 40 nginx workers and should give a good split of traffic across the 3 endpoints.

Thirdly can you check your nginx conf file and see if you are configuring the listen directive with the 'reuseport' option. If you do not do this you can run out of ports which limits the number of connections per second you can make. In the past this has caused me to max out at 40-42K cps so it is possible it is your issue.

If all of the above is ok then we can look further at the tuning of other things like how many file descriptors are available on your system etc.

Kind Regards,

Steve.

minyili-fortinet commented 7 years ago

Hi @stevelinsell, Thanks for your advice. It turns out our testing environment limits the CPS. Sorry for the mistaken results I gave earlier. After fixing the avalanche tester, we've got better performance result. nginx can get 60k cps with 40 worker processes and 51k cps with 32 worker processes. I deleted the "dc" in ServicesEnabled and below is the output of fw_counters in 3 endpoints when running nginx with 40 worker processes. I shorten them to one engines per endpoint because all ten engines in the same endpoint do almost the same amout of jobs.

/sys/kernel/debug/qat_c6xx_04\:00.0/fw_counters +------------------------------------------------+ | FW Statistics for Qat Device | +------------------------------------------------+ | Firmware Requests [AE 0]: 7642988 | | Firmware Responses[AE 0]: 7642988 | +------------------------------------------------+ ... /sys/kernel/debug/qat_c6xx_05\:00.0/fw_counters +------------------------------------------------+ | FW Statistics for Qat Device | +------------------------------------------------+ | Firmware Requests [AE 0]: 7643673 | | Firmware Responses[AE 0]: 7643673 | +------------------------------------------------+ ... /sys/kernel/debug/qat_c6xx_06\:00.0/fw_counters +------------------------------------------------+ | FW Statistics for Qat Device | +------------------------------------------------+ | Firmware Requests [AE 0]: 6578674 | | Firmware Responses[AE 0]: 6578674 | +------------------------------------------------+ ...

I think the result is correspond to my qat configuration. My nginx is configured as a reverse proxy. I will attach the nginx config file to you. nginx.txt

Nginx is not our final production but we will take it as a baseline performance. Should I take the 60k cps as our platform's target value? Or maybe I can expect more?

I'm intergrating qat engine to our multi-threaded proxy and multi-threading will lower the performance by lock contentions. I wonder if qat engine or qat have lock contentions problems in multi-threading applications.

stevelinsell commented 7 years ago

Hi @minyili-fortinet,

Great that your performance has improved. Your split of traffic is as expected, due to the way endpoints are assigned the 3rd endpoint will have slightly less nginx processes using it which is why it has less requests. I would expect you to be able to achieve higher CPS when running that many cores, 60K CPS is still pretty low. How is your CPU utilization, are you hitting 90%+? The testing I have done has been with using nginx as a straight webserver and I have been easily able to hit 100K CPS with that many cores. Obviously a reverse proxy configuration has additional CPU overhead of dealing with the backend http servers, but I would be surprised if you could not get close to 100K CPS. I will check with one of my colleagues later today who may have run some testing with a reverse proxy setup.

I've had a look at your nginx configuration file. It looks good to me, the only comment I would have is possibly the worker_connections is a little low at 10240 when you are pushing for a lot of handshake performance. I would suggest increasing this to around 64K and see if that improves your CPS. You could then tune it to an acceptable trade off between how much memory you want nginx to use on allocating connections up front and the performance you want to achieve.

In terms of multi-threading, it almost certainly has an impact on the performance you'll be able to achieve. We don't run a lot of multi-threaded applications but it appears most of the locking contention is from locking in the OpenSSL error stack and also locking around the way OpenSSL checks for engines when dealing with crypto operations, both of which are out of our control. There is a conversation here on the topic: #32 , although that is based on OpenSSL 1.0.1 and the QAT Engine for OpenSSL 1.0.1 rather than the 1.1.0 version that is OpenSourced here. I think there is a little more understanding needed around lock contention when running multi-threaded, for instance Apache runs multi-threaded with OpenSSL so do they not see similar lock contention issues? Also note to run multi-threaded you'll need to reconfigure your QAT Driver config files, there are examples in the project here: https://github.com/01org/QAT_Engine/tree/master/qat/config/c6xx/multi_thread_optimized

Kind Regards,

Steve.

minyili-fortinet commented 7 years ago

Hi @stevelinsell , Thanks for your reply. Great news to hear the cps should be 100k on our platform. But now we got nearly 95% of cpu usage when nginx running 60k cps with 40 worker processes. I've tried to change worker_connections to 64k but unfortunately the result is the same. I've gathered the top and perf result and attach them to you. intel-nginx-performance-data.txt

On the multi-threading side. I've written a cavium-ssl-hardware-engine after the openssl-1.1.0 released. The engine also uses the async mechanism and I managed to resolve the locking contentions in OpenSSL by patching some codes to OpenSSL engine module. Error stack of OpenSSL in 1.1.0 has been optimized using ThreadLocal. The cavium-async-engine and my patched-OpenSSL works well and the lock only consumes less than 2% of the cpu when running cps test in around 42k. I expect the intel engine will be the same but it turned out the _raw_spin_lock took around 17% of the cpu when cps is also around 42k. I think there is still room for improvement. As far as I know, engine send jobs to hardware throught specify rings. If I bind each thread to one ring and set the limit access enable in qat configuration. The ring is consume by one consumer and accept only one producer which will bring it lockless.

stevelinsell commented 7 years ago

Hi @minyili-fortinet,

I'll concentrate on the performance side to start with. I'm trying to understand what is different about your setup and what we are running here.

1) Firstly would it be possible for you to configure your nginx to be a straight webserver and see if you are still limited to 60K. This would give understanding whether it is the reverse proxy configuration that is contributing to the 60K limit of whether it is something in the OS/network stack/hardware configuration that is proving to be the bottleneck. 2) You are using 2 x E5-2650's which provide 20 hyperthreads on each processor. Am I correct in assuming that you are running 1 nginx process on each hyperthread so that you are using both processors fully? From your stats you provided the c6xx device is attached to the PCIe bus of the first processor. This means any offload requests coming from hyperthreads on the second processor will need to traverse the QuickPath (QPI) link to reach the acceleration card. For asymmetric crypto offload (RSA 2K in your case) this shouldn't be excessive so I doubt it is the issue, but in our benchmarking we tend to be using processors with more hyperthreads on each processor so we tend to hit 100K using only the first processor. 3) How much memory are you using and how are your DIMM slots populated? If all the slots are not populated then that may have an effect on the available memory bandwidth, again though I would expect that to have more impact on symmetric crypto offload than on asymmetric offload. 4) How much network bandwidth do you have? Which PCIe bus(es) are the network adapters connected to? How are your network adapters configured in terms of queue depths etc? How do you spread the interrupts from the network adapters? 5) What are your OS settings for max number of file descriptors per process (ulimit -n). This should be much greater than 1024, say 64K for instance.

Let me know your answers and if any of the suggestions are beneficial.

Steve.

minyili-fortinet commented 7 years ago

Hi @stevelinsell , Thanks for your reply. Let's focus on the nginx performance, I'll try to fix the lock contention problems later.

Firstly would it be possible for you to configure your nginx to be a straight webserver and see if you are still limited to 60K.

I've tried to configure nginx to a straight server and get 73k cps when CPU usage was 95%. Better than the reverse proxy but still not hitting the c6xx hardware bottleneck. I assume you are using E5 2699 v3 to hit the 100k cps. What frequency the CPU was running at when doing the cps test? I disabled the turbo-boost feature when doing the test.

Am I correct in assuming that you are running 1 nginx process on each hyperthread so that you are using both processors fully?

Yes, I'm running 40 nginx processes and I can see each process is scheduled to one hyperthread. I installed the latest ixgbe module and spreaded the irq evenly to 40 CPUs and all processer is running fully after the adjustment.

How much memory are you using and how are your DIMM slots populated?

I'm using 8 x 8G memory and using half of the DIMM slots. All memory is installed into the same color slots.

How much network bandwidth do you have?Which PCIe bus(es) are the network adapters connected to? How are your network adapters configured in terms of queue depths etc? How do you spread the interrupts from the network adapters?

The cps testing network traffic is around 1.6Gbps. There are two intel 10G controllers and each controller is connected to one CPU. Each controller provides 2 10G SFP+ and I used all 4 SFP+ in my test. I set the RSS to 10,10,10,10 and spread the interrupts evenly to 40 hyperthreads.

What are your OS settings for max number of file descriptors per process (ulimit -n)

ulimit -n is set to 200000.

stevelinsell commented 7 years ago

Hi @minyili-fortinet,

It looks to me like you're doing a great job of tuning your system, I can't see any issues there. My personal experience of running 100K cps has been using a 2x E5-2699 v4 system running at 2.2GHz: https://ark.intel.com/products/91317/Intel-Xeon-Processor-E5-2699-v4-55M-Cache-2_20-GHz. That's a 14nm Broadwell based system with a total of 88 available hyperthreads so it's hard to compare with your setup.

One thing I forgot to ask you is around the requests you are submitting to nginx. Are you measuring pure handshake, i.e. you handshake and then disconnect before requesting a file or are you requesting either a 0 byte file or a very small file and then disconnecting? This has an impact on CPU usage. Based on the details so far it looks increasingly like you are genuinely CPU bound so if you are not doing a pure handshake you maybe able to move to doing that to give more cycles to the handshake and push up measured performance.

Steve.

minyili-fortinet commented 7 years ago

Hi @stevelinsell ,

Thanks for your reply.

I measured cps with a ssl handshank, requesting a small page and disconnecting without close notify to simulate real connections. I can't setup a single ssl handshake test on our avalanche tester unfortunally.

I think the processing of ssl connections is highly CPU bounded. Even thought we offload almost all ciphers to accelerator the processing still consume most of the cpu cycles partially because of the vast memory operations from the implmentation of openssl and protocol security concerns. Should I take 73k cps on nginx web server and 60k cps on nginx reverse proxy as the baseline performance on our platform? By the way, can I ask the cpu utilization on your platform when cps test hitting 100k cps? The data will help us to calculate the cpu cycles per connection.

If our cps data is acceptable comparing with yours on your platform, I'll put my mind to fix the locking contention in qat.

stevelinsell commented 7 years ago

Hi @minyili-fortinet,

Sorry for the slow response! Unfortunately I cannot post my performance/cpu utilization numbers here, all that kind of data needs to be approved. I am still enquiring whether this information has been published yet for the c6xx product, I'll let you know as soon as I get an answer. In the meantime as your platform seems configured well I would suggest you take the numbers you have measured as a baseline for now and move on to the locking contention issues.

I was thinking about your suggestion about trying to make each set of rings associated with a single producer and consumer. Here is my take on it, which may or may not be helpful:

In order to do what you wanted you would need one crypto instance for each thread. Assuming you are only running one process you'd want to set your QAT driver config files to each declare a certain number of crypto instances (say 12 for example). Then you could set LimitDevAccess=0 in all the config files to ensure you can access instances from all end points within your process (you'd want to set NumProcesses = 1 as well). This would result in your application getting 36 crypto instances (sets of rings). Then in each thread you would need to use the SET_INSTANCE_FOR_THREAD engine specific control message to associate the thread with a specific instance.

Your biggest problem is then with polling for responses on the rings. If you were to leave the default polling method then you'll have 1 polling thread which every time round the polling loop you are going to poll on all 36 sets of rings. This is probably going to hurt your performance and may have locking implications. If you can (i.e. your threads are not blocking) then I would suggest you move to external polling for each thread by using the engine specific control message ENABLE_EXTERNAL_POLLING. You then use the engine specific control message POLL to poll from each thread. To do that you are effectively moving control of the polling to the application. This has the advantage that the POLL code in the QAT Engine is already enabled to look only for the instance that was set with SET_INSTANCE_FOR_THREAD that is stored thread local and will only poll that instance (set of rings). You then have complete control of producing and consuming on a single set of rings from each thread.

Another alternative way of polling is to use the event driven polling feature. This involves setting Cy{X}IsPolled = 2 for each instance within the driver config files. You then enable Event Driven Polling in the QAT Engine using the engine specific control message ENABLE_EVENT_DRIVEN_POLLING_MODE. Once enabled when the engine gets initialized a single thread will be created which will contain an epoll loop that monitors a file descriptor associated with each crypto instance. When an event occurs on the file descriptor the crypto instance will be polled for a response. Although this still uses a single thread for polling like the default polling, it is far more efficient as it only polls when there is a response waiting. Strangely we haven't seen good results with this method when using only a few instances but it might be ideally suited to a multi-threaded scenario where there are a lot of instances. Obviously in that case you'll have a different thread consuming to what is producing which may have more locking implications and not what you're after.

Hope some of that is useful,

Steve.

minyili-fortinet commented 7 years ago

Hi @stevelinsell ,

Thanks for your advice. So 60k will be our goal on our platform. I've configured the engine to external polling and bound each thread to one ring. With some patches to openssl, I successfully raised the cps up to 55k+ on our multithreading platform which I think is an acceptable number.

archerbroler commented 6 years ago

hi @stevelinsell ,could you give me sample code about using ENABLE_EXTERNAL_POLLING in nginx? I am using the solution in nginx envirenment, but find out I cannot enable the ENABLE_EXTERNAL_POLLING without modify the code.

kaleb-himes commented 6 years ago

Informational:

wolfSSL has an open-source implementation of this.

We are going to be adding additional async support for offloading to other threads IE a software-only async solution to do just crypto work.

In addition to the software only solution we will be leveraging QAT to take FULL advantage of asynchronous capabilities.

We expect to see far greater numbers from leveraging both software and QAT async simultaneously.

If interested in seeing the results shoot us a note at info@wolfssl.com or keep an eye on our blog here for updates: wolfSSL Blog

Once we finish the software abstraction work we'll also be posting updated number for Intel Quick Assist along with our current numbers here: wolfSSL Async with QAT Benchmark numbers

To your note @minyili-fortinet

The whole cpu usage is around 60% (the core binded with worker thread has 25% idle) but the CPS couldn't go further.

We expect to be able to take FULL advantage of the CPU's and QAT and minimal idle with this solution.