inverse-inc / packetfence

PacketFence is a fully supported, trusted, Free and Open Source network access control (NAC) solution. Boasting an impressive feature set including a captive-portal for registration and remediation, centralized wired and wireless management, powerful BYOD management options, 802.1X support, layer-2 isolation of problematic devices; PacketFence can be used to effectively secure networks small to very large heterogeneous networks.
https://packetfence.org
GNU General Public License v2.0
1.39k stars 291 forks source link

Error 3221225762 in AD authentication #8345

Closed MarioSpenc closed 2 weeks ago

MarioSpenc commented 1 month ago

We have a ISO Debian 12 PF V14.0 installation with data import from running-fine 13.1.

Basically it works, 50% of time, between we have following rejects in Radius Audit logs:

Called-Station-Id = "80:e8:6f:9c:18:87",
Calling-Station-Id = "2c:ea:7f:0a:4e:e2",
EAP-Message = "0x0225004a1a022500453148f97cfe5c4a9343727243c911c0e7530000000000000000e0e7bb18a4b6a3a7b773bb861700bdfae85e72f9074a838f004555524f50455c616c746868656765",
EAP-Type = "MSCHAPv2",
Event-Timestamp = "Oct 10 2024 13:07:26 CEST",
Framed-MTU = "1500",
FreeRADIUS-Proxied-To = "127.0.0.1",
MS-CHAP-Challenge = "0x778fda51024c231f00c350884d331b62",
MS-CHAP-User-Name = "xxx",
MS-CHAP2-Response = "0x255548f97cfe5c4a9343727243c911c0e7530000000000000000e0e7bb18a4b6a3a7b773bb861700bdfae85e72f9074a838f",
Module-Failure-Message = "chrooted_mschap: Program returned code (1) and output 'NT Error: code: 3221225762
message: (3221225762
'Indicates a name that was specified as a remote computer name is syntactically invalid.')'",
Module-Failure-Message = "chrooted_mschap: External script says: NT Error: code: 3221225762
message: (3221225762
'Indicates a name that was specified as a remote computer name is syntactically invalid.')",
Module-Failure-Message = "chrooted_mschap: MS-CHAP2-Response is incorrect",
NAS-IP-Address = "10.xxx",
NAS-Identifier = "xxx",
NAS-Port = "50107",
NAS-Port-Id = "xxx",
NAS-Port-Type = "Ethernet",
PacketFence-Domain = "xxx",
PacketFence-KeyBalanced = "xx",
PacketFence-NTLM-Auth-Host = "100.64.0.1",
PacketFence-NTLM-Auth-Port = "5000",
PacketFence-Outer-User = "xx\x",
PacketFence-Radius-Ip = "xxx",
Realm = "default",
Service-Type = "Framed-User",
State = "0xbaa4429eba815889078d36d6c0de94e9",
Stripped-User-Name = "xxx",
User-Name = "EURxxxOPE\xxx",
User-Password = "******"

RADIUS Reply
EAP-Message = "0x04250004",
MS-CHAP-Error = "%!E(MISSING)=691 R=0 C=20dd2031319e98badf2b398597145013 V=3 M=Authentication rejected",
Message-Authenticator = "0x00000000000000000000000000000000"
MarioSpenc commented 1 month ago

As we drove into too much problems with AD integration in V14, we have to go back to V13.1. I think we will wait for more stable version ... ;-)

stgmsa commented 1 month ago

do you have a PacketFence cluster ? or if your AD is running on HA mode but some of the nodes does not work?

On Thu, Oct 10, 2024 at 09:30 MarioSpenc @.***> wrote:

As we drove into too much problems with AD integration in V14, we have to go back to V13.1. I think we will wait for more stable version ... ;-)

— Reply to this email directly, view it on GitHub https://github.com/inverse-inc/packetfence/issues/8345#issuecomment-2405101506, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOZ43DPTFAWDSRFEECLOZDZ2Z6QLAVCNFSM6AAAAABPWKCKT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVGEYDCNJQGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MarioSpenc commented 1 month ago

no cluster, AD nodes are all working, tried also different ones ...

mention: V13.1 PF works like a charm with absolutely same configuration!

stgmsa commented 1 month ago

on 14.0, the config structure for domain changed. How did you migrate the old settings to 14.0 ?

On Thu, Oct 10, 2024 at 09:46 MarioSpenc @.***> wrote:

no cluster, AD nodes are all working, tried also different ones ...

mention: V13.1 PF works like a charm with absolutely same configuration!

— Reply to this email directly, view it on GitHub https://github.com/inverse-inc/packetfence/issues/8345#issuecomment-2405141805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOZ43A7QCVUOND4NB2UQ2DZ22AMHAVCNFSM6AAAAABPWKCKT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVGE2DCOBQGU . You are receiving this because you commented.Message ID: @.***>

MarioSpenc commented 1 month ago

export/import SQL based (not Mariadb-backup!)

E-ThanG commented 1 month ago

Same issue here. Brand new ISO 14.0 install with nothing imported. I'm building from scratch on this VM just to make sure there isn't weirdness from importing or earlier troubleshooting of unrelated issues.

It's this function call in the ntlm-auth-api rpc.py module that is crashing. The "remote computer name" that it is referring to appears to be the "server_name" variable:


ntlm-auth-api-domain[65796]: [2024-10-15 14:44:05,300] ERROR in app: Exception on /ntlm/auth [POST]
ntlm-auth-api-domain[65796]: Traceback (most recent call last):
ntlm-auth-api-domain[65796]:  File "/usr/local/pf/bin/pyntlm_auth/rpc.py", line 140, in transitive_login
ntlm-auth-api-domain[65796]:    result = global_vars.s_secure_channel_connection.netr_LogonSamLogonWithFlags(server_name, workstation,
ntlm-auth-api-domain[65796]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ntlm-auth-api-domain[65796]: samba.NTSTATUSError: (3221225762, 'Indicates a name that was specified as a remote computer name is syntactically invalid.')

I noticed that in rpc.py the function init_secure_connection() uses a global to initially populate the server_name. Using that server it finds a random DC to start the secure connection with. That random selection isn't saved back to the global though.

def init_secure_connection():

   <SNIP>

    server_name = global_vars.c_server_name # FQDN of Domain Controller   <------ Global

    domain_controller_records = utils.find_ldap_servers(global_vars.c_realm, global_vars.c_dns_servers)
    if len(domain_controller_records) > 0:
        idx = random.randint(0, len(domain_controller_records) -1)
        record = domain_controller_records[idx]
        server_name = record.get('target')   <------ Local

The transitive_login() function populates its own server_name from the global and proceeds to use the secure connection that was initialized with the randomly selected server_name. That might be a problem, it looks like it'd initiate a connection to server A but then try to talk to server B. It's random, so sometimes you would end up using the secure connection to the same server as is configured in the global, and other times it'd be some other server. If you only have one DC, you wouldn't see an issue.

def transitive_login(account_username, challenge, nt_response):
    server_name = global_vars.c_server_name    <------ Global
    domain = global_vars.c_domain
    workstation = global_vars.c_workstation
    global_vars.s_secure_channel_connection, global_vars.s_machine_cred, global_vars.s_connection_id, error_code, error_message = get_secure_channel_connection()

   <SNIP>

        try:
            result = global_vars.s_secure_channel_connection.netr_LogonSamLogonWithFlags(server_name, workstation,   <-- Still using the global value
                                                                                         current, subsequent,
                                                                                         logon_level, logon,
                                                                                         validation_level,
                                                                                         netr_flags)
            (return_auth, info, foo, bar) = result

13.2 doesn't have the random selection process, that's the only change I see in rpc.py. In 13.2 the global value is the only server_name that is ever used. Also, doesn't this break the concept of a sticky DC as well? I may be confused on the difference between an AD authentication source and the domain configuration, it seems like 2 sides of the same coin.

Lastly, this docker is very hard to get information out of. I added a bunch of print statements. I only see them printing ~10% of the time, even when the authentication is successful. My best guess is that there is a race with the 2 threads started in app,.py. I always see config_load() printing at startup, but later, nothing. I was trying to troubleshoot that aspect, but I've broken everything now and I can't get ntlm-auth-api to start at all. Never fear, I made a VM snapshot before tinkering.

salamander555 commented 1 month ago

We are also experiencing an issue with Active Directory authentication in the new PacketFence 14.0. It occurs regardless of whether it is a fresh installation (ISO or appliance) or an imported configuration from our existing version 13.2.

The main issue we identified is that it randomly selects a Domain Controller in the AD. Since we have many branch offices to which the ports are blocked, this leads to failures. If the correct Domain Controller is chosen by chance, there is no error. The configuration is correct, Sticky DC is set (but appears to be ignored).

E-ThanG describes a similar behavior in the previous post.

stgmsa commented 1 month ago

Thanks @E-ThanG @salamander555 we're investigating, we'll make a patch once confirmed.

E-ThanG commented 1 month ago

I did testing of my own, there are a few issues. The first is the theory I mentioned above, it does indeed fail when the secure communication is opened to one server_name and netr_LogonSamLogonWithFlags is called with a different server_name.

Additionally, our AD domain has 6 DCs, but only 2 are on our primary campus. The on-campus DCs are preferred for the best performance.

IMO AD authentication sources should use the configured DCs only, not all of the potential DCs. The configuration is called "Host", so we should configure FQDN hostnames of DCs, not the domain name that contains all DCs. Also, it should only chose a random DC if "Shuffle" is enabled and it should only pick from the configured DC list. I suppose there's an argument for the configuration being "DC Hostname or Domain DNS". If a DC hostname is configured (DNS name with single IP), it doesn't try to find others. If it's a domain DNS entry that contains multiple DC host IPs, then use all DCs found. I'd prefer that being an additional configuration option though.

Regarding logging of debug/error messages, I copied the method from config_load() and used simple print statements. Adding the second argument "file=sys.stderr" lets me see all of the print statements all of the time. Since this is run in a Flask app, it'd be ideal to use Flask's logging function instead of print though.