inverse-inc / packetfence

PacketFence is a fully supported, trusted, Free and Open Source network access control (NAC) solution. Boasting an impressive feature set including a captive-portal for registration and remediation, centralized wired and wireless management, powerful BYOD management options, 802.1X support, layer-2 isolation of problematic devices; PacketFence can be used to effectively secure networks small to very large heterogeneous networks.
https://packetfence.org
GNU General Public License v2.0
1.27k stars 274 forks source link

Perfomance hit added by Active Directory bypass #1093

Closed foscarini closed 8 years ago

foscarini commented 8 years ago

We use packetfence as a wireless Captive Portal solution in a University (UFRGS), which has more than 10.000 active users each day, using 6 packetfence hosts.

We updated one of our gateways to version 5.4 last november, and detected a huge performance loss and decrease in the number of successful registered users, as seen in the graphs below.

screen shot 2016-01-04 at 10 35 16 pm

After some investigation, we detected that the pfdns process was using 100% CPU most of the time, and that several DNS queries were being lost, or taking several seconds to be answered. We activated the DEBUG level in the logs, and realized that the CHI L1 cache (memcache) was taking almost 40 milliseconds to deliver an answer to each query of unregistered clients.

Nov 17 ... pfdns ... key='??', cache='File:l1_cache', time='39ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 ... pfdns ... key='??', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)

Being an open wifi with a Captive Portal, we always have hundreds of unregistered users, and these "rogue" clients were delaying the DNS resolution of "good" clients, sometimes not redirecting them to the Captive Portal.

In version 5.4 a new feature was introduced, allowing Active Directory access from registration network, and was implemented as an allowed bypass domain saved to a CHI cache object. We patched pfdns, to perform a CHI cache get only if Active Directory access is enabled. This test solved our performance problem. The patch for version 5.4 is attached bellow.

I believe that should have a better way to discover if the Active Directory bypass is active, perhaps explicitly enabling it in the configuration menu.

foscarini commented 8 years ago
220,232c220,236
<             #Check if the answer already exist
<             my ($ttl, $rdata);
<             my $query = $CHI_CACHE->get($qname);
<             if ($query) {
<                 foreach my $rr ($query->answer) {
<                     next unless $rr->type eq "A";
<                     $rdata= $rr->address;
<                     push @ans, new Net::DNS::RR("$qname $TTL $qclass $qtype $rdata");
<                 }
<                 if(@ans) {
<                     $rcode = "NOERROR";
<                 } else {
<                     $rcode = "NXDOMAIN";
---
>             #Check if the answer already exist in cache only if Active Directory access is allowed in registration vlan (domain.conf)
>             my $size = keys %domain_dns_servers;
>             if( $size > 0) {
>                 my ($ttl, $rdata);
>                 my $query = $CHI_CACHE->get($qname);
>
>                 if ($query) {
>                     foreach my $rr ($query->answer) {
>                         next unless $rr->type eq "A";
>                         $rdata= $rr->address;
>                         push @ans, new Net::DNS::RR("$qname $TTL $qclass $qtype $rdata");
>                     }
>                     if(@ans) {
>                         $rcode = "NOERROR";
>                     } else {
>                         $rcode = "NXDOMAIN";
>                     }
233a238,239
>
>             #Provide the default DNS answer to unregistered clients
julsemaan commented 8 years ago

Can you give us some more of the log around the misses you put. The ones that are there are hitting the file cache and not memcached which is slower.

Also, please provide the version of your 5 other hosts

I will also verify with a colleague if the miss time includes the computation of the value that will be cached. In that case, that would explain why it is 40ms.

Also, PacketFence 5.6 will bring a multi process pfdns.

You can either wait and upgrade to it when it is released next week or try the patch below : https://patch-diff.githubusercontent.com/raw/inverse-inc/packetfence/pull/1047.diff

foscarini commented 8 years ago

We have 5 gateways running pf 5.0.1 on ubuntu 12.04 installed from packetfence repository. The new gateway is running pf 5.4.0 on CentOS 6, also installed from packetfence repository.

With TRACE debug we had a little bit more of information, but the TRACE level didn't log elapsed time for each function call of CHI library, so I couldn't detect where the 40ms were spent.

We are upgrading one of our other nodes to pf5.5 next week, and we will try to run without our patch and see if the problem is solved.

Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: winphone.wns.windows.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='winphone.wns.windows.com', cache='File:l1_cache', time='40ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='winphone.wns.windows.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: skydrive.wns.windows.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='skydrive.wns.windows.com', cache='File:l1_cache', time='39ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='skydrive.wns.windows.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: amer.dell.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: clients4.google.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='clients4.google.com', cache='File:l1_cache', time='40ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='clients4.google.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: cmdts.ksmobile.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='cmdts.ksmobile.com', cache='File:l1_cache', time='40ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='cmdts.ksmobile.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: www.google.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='www.google.com', cache='File:l1_cache', time='40ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='www.google.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Requested Domain: www.google.com (main::regzone)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) TRACE: Memory configuration is still valid for key config::Pf in local cached_hash (pfconfig::cached::is_valid) (pfconfig::cached::is_valid)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='www.google.com', cache='File:l1_cache', time='40ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Nov 17 20:49:06 pfdns(10156) DEBUG: cache get for namespace='pfdns', key='www.google.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
julsemaan commented 8 years ago

Can you send us the content of /usr/local/pf/conf/chi.conf

It just seems like you are always missing the cache and that means pfdns needs to resolve the name its given everytime (and if its a passthrough then it is expensive)

5.6 will bring the multi-process pfdns This, along with a proper caching configuration should fix your issue since from the logs I saw, the AD code doesn't seem to be causing your issue.

foscarini commented 8 years ago

It misses every time because these domains are not cacheable... I tested the memcached, and it's response time for each query was (and is) near 2ms.

In my understanding, the AD code added this cache get for each DNS type A query...

[foscarini@semfio02 ~]$ ls -alh /usr/local/pf/conf/chi.conf
-rw-r--r--. 1 pf pf 545 Oct  1 16:19 /usr/local/pf/conf/chi.conf

[foscarini@semfio02 ~]$ cat /usr/local/pf/conf/chi.conf
[storage DEFAULT]
storage=file

[storage DEFAULT l1_cache]
storage=memcached

[storage ldap_auth]
expires_in=10m

[storage ldap_auth l1_cache]
expires_on_backend=1

[storage httpd.admin]
expires_in=1d

[storage httpd.admin l1_cache]
expires_on_backend=1

[storage httpd.portal]
expires_in=10m

[storage httpd.portal l1_cache]
expires_on_backend=1

[storage memcached]
driver = Memcached::libmemcached
servers = 127.0.0.1:11211
compress_threshold = 10000
behavior_binary_protocol = 1

[storage file]
driver=File
root_dir=/usr/local/pf/var/cache
foscarini commented 8 years ago

I've tested a fresh pf-5.5.2 installation in CentOS 6 with 10 fake clients, at a rate of 300 DNS queries per second, and the latency is 0ms.

pf-5.5 uses only redis as cache storage, so I repeated the test using the same configuration from 5.4, (including the memcache driver), and in both cases I couldn't reproduce the performance problem...

Possibly the performance problem can be some misconfiguration in our installation. We will upgrade the other wifi gateways in the following weeks, and if I find out any new information, I will reopen this issue.

Thanks for your attention @julsemaan !

Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='inbox.google.com', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='inbox.google.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='ssp.lkqd.net', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='ssp.lkqd.net', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='mconflive2.ufpe.br.ufrgs.br', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='mconflive2.ufpe.br.ufrgs.br', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='e15.whatsapp.net', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='e15.whatsapp.net', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='notify3.dropbox.com', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='notify3.dropbox.com', cache='File', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 00:58:49 pfdns(5529) DEBUG: cache get for namespace='pfdns', key='code.jquery.netdna-cdn.com', cache='File:l1_cache', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='dr.skype.net', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='sued5.deussemprefiel.com.br', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='c-0001.c-msedge.net', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='d.pzkysq.pink', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='drive.google.com', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
Jan 07 01:00:50 pfdns(7008) DEBUG: cache get for namespace='pfdns', key='fbcdn-photos-e-a.akamaihd.net', cache='Redis', time='0ms': MISS (not in cache) (CHI::Driver::_log_get_result)
julsemaan commented 8 years ago

I would agree with you that this seems related to a misconfiguration of the cache

Will close this, feel free to reopen if you end up being right for the AD latency :)