lithnet / ad-password-protection

Active Directory password filter featuring breached password checking and custom complexity rules
MIT License
496 stars 52 forks source link

Changing passwords is suddenly very slow #104

Closed darkpixel closed 1 year ago

darkpixel commented 1 year ago

Changing passwords is extremely slow after deploying ad-password-protection.

As you can see from the screenshot, the password change took just over 4 minutes to process:

Screenshot from 2023-03-23 11-29-45

On the client side they enter in their password and Windows says it needs to be changed. It pops up the change password box with their "old" password pre-filled in and asked them for a "new" password and to "confirm" the password.

The moment they submit the form, the server logs that it is "Processing a password change request for..." and the client hangs for a while "spinning" saying the password is being changed. After a minute or two the client sorta "times out" says something about "bad password"--not that the new password is bad, but implying the old password is bad.

A few minutes later the event log on the server shows the new password was finally accepted (The password change request for user (user full name) was approved) and AD shows that the account no longer has the "User must change password at next logon" flag set.

...but the user can't sign in with the "old" password or the "new" password.

One strange thing I noticed while trying to troubleshoot this is that our Linux servers have trouble accessing Active Directory during the password change. The Linux boxes are AD Members (not Domain Controllers)...and the moment a user clicks the button to change their password, they stop being able to retrieve information from AD.

i.e. if we have a Seattle office and a Portland office, and someone in Seattle changes their password, Linux boxes in Seattle hang trying to resolve user information (the winbindd process)...but if I point the Linux box in Seattle to look at Portland it never hangs...unless someone in Portland also tries changing their password. The hang lasts about as long as the Windows Server takes to log that the password changed.

ryannewington commented 1 year ago

@darkpixel this is a very strange one.

Can you tell me a bit more about the password policy settings you have in place. Are you using the compromised password/banned word facility? How are your DCs accessing the store?

darkpixel commented 1 year ago

Yeah, we were using the compromised password list and the banned word part. Disabling the compromised list fixed the issue. I'm betting it has something to do with the number of files / size of the v3/p folder being 13 gigs. ;) I disabled it, and passwords are changing much faster now.

I'm going to delete the store and try instead with a smaller list of banned words instead of compromised passwords.

ryannewington commented 1 year ago

That shouldn't be an issue. The store scales up to billions of passwords.

It uses a very efficient binary search algorithm. No matter the number of passwords in the store, on average, there are only 12 x 14-byte disk reads required to find a matching password.

Is it a local store or on a network share? In either case their could be latency issues with the underlying storage?

It might be worth trying to exclude the store from AV.

darkpixel commented 1 year ago

Sorry for the delay. I tested it on a machine that didn't have our AV installed. I even disabled Windows Defender.

That makes sense about the search algo. It's sitting on a network share...but based on the architecture there shouldn't be a network performance penalty. The DC is a virtual machine with a virtual NIC that's inside a bridge on the hypervisor. The bridge also has the hypervisor's IP. While I can pull files at over 200 MB/sec between the hypervisor and the DC VM, I'm starting to suspect it's something in Samba.

While there don't appear to be any CPU issues, the winbind process locks up for a bit.

I'm going to do some more testing this weekend, but I'm suspecting it's NOT an issue with lithnet.

ryannewington commented 1 year ago

Thanks for the update @darkpixel

If your winbind investigations don't turn up anything, I'd recommend you consider switching to a local store, replicated with DFS-R. It combines the best of both worlds in that you still only have a single place that needs to be updated, but eliminates any network issues, outages, or latency during lookups.

darkpixel commented 1 year ago

Ugh. I hate DFR-R. A Samba share and syncthing are the way to go to keep things replicates between offices. We avoid Windows at all costs. ;) In 2023 we're migrating our last Windows-based app to web-based app and then we're ditching most of our infrastructure and switching to stateless PXE-boot Linux boxes that run Chrome.

darkpixel commented 1 year ago

So...I ran a few tests. Completely deleting the password store (the v3 folder and everything under it)...and we still see the performance issue. i.e. if the store path is \\customer.local\corp\deployment\pwstore, I leave the pwstore folder, but delete everything under it.

Adding ~500,000 records to the banned words store makes changing passwords time out more frequently, but get-passwordfilterresult still returns in under 1 second.

I'm assuming get-passwordfilterresult is using the same method to check the store as the password filter DLL.

Running get-passwordfilterresult on the server returns almost immediately (like under 1 second), but it took just over 4 minutes between: Processing a password change request for username (User Full Name). and: The password change request for username (User Full Name) was approved. when the user tried to change their password from a workstation.

jemmiegod commented 1 year ago

It could be possible that you're experiencing latency or replication issues between the domain controllers. If you run a ping from the workstation to the domain controller, what's the latency? Anything over 30ms and you'll start to see issues, particularly with login times and folder redirection.

It might be worth running repadmin and dcdiag to see if that uncovers any underlying issues. Do you have multiple Active Directory sites or have you manually configured replication partners to override KCC?

Are you able to change a password from a server without delay or are you seeing the issue from both servers and workstations? If you're just seeing problems from certain machines, the problem is not with LPP, but sounds like you've got network issues at play.

You might consider logging a support case with Microsoft to further assist.

On Wed, 5 Apr 2023, 2:22 am Aaron C. de Bruyn, @.***> wrote:

So...I ran a few tests. Completely deleting the password store (the v3 folder and everything under it)...and we still see the performance issue. i.e. if the store path is \customer.local\corp\deployment\pwstore, I leave the pwstore folder, but delete everything under it.

Adding ~500,000 records to the banned words store makes changing passwords time out more frequently, but get-passwordfilterresult still returns in under 1 second.

I'm assuming get-passwordfilterresult is using the same method to check the store as the password filter DLL.

Running get-passwordfilterresult on the server returns almost immediately (like under 1 second), but it took just over 4 minutes between: Processing a password change request for username (User Full Name). and: The password change request for username (User Full Name) was approved. when the user tried to change their password from a workstation.

— Reply to this email directly, view it on GitHub https://github.com/lithnet/ad-password-protection/issues/104#issuecomment-1496261489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3IH3WTSDEZRTEMOTTTNTLW7RDDDANCNFSM6AAAAAAWF3BSNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ryannewington commented 1 year ago

Another point to consider is get-passwordfilterresult runs in your user context, where as actual password changes are running in the DCs system context. It could be that the DC is having issues authenticating itself to the remote share.

My guess is something is wrong with the interaction between the SMB server and windows. A Wireshark trace could reveal more information.

darkpixel commented 1 year ago

I'm going to keep digging to try to track things down...but to respond to a few of the questions:

We have multiple AD sites. One DC per site. No latency between workstations and the DC at their site. The latency between any two random sites is typically under 25msec.

Repadmin /replsum is clear. Replication between sites happens every ~15 minutes. We have monitoring systems that verify this every hour.

I can reset passwords for users from the server without issues, but that's probably because I have "Enable for password set operations" unchecked for "Reject normalized passwords found in the banned words store" because HR hands out an easy password that everyone knows for new employees.

The share that hosts the pwstore allows anyone and anything to connect. It holds not only the banned words store, but also tons of installers that get deployed to workstations. At any random time there are authenticated users, authenticated computers, and guest (anonymous) users connected, so I don't think it's an authentication issue at play. Permissions on all files are "everyone full control", but the share itself is read-only. Nothing in the Windows environment can make changes to it.

I highly doubt there are any network issues involved.

I tested get-passwordfilterresult from the machine's system account about 50 times across ~24 servers and didn't see any timeouts.

Is there any additional logging I can enable for the password filter DLL to see why it's taking so long?

I'll try to get a wireshark capture some time this week when things settle down.

darkpixel commented 1 year ago

It's been difficult to get a wireshark capture. The delay/timeout appears to be somewhat random. I changed my password somewhere around 10 times over the weekend while capturing with wireshark. Zero issues.

Then I said "Huh..oh well", closed wireshark and then decided to set my password back....it timed out.

I re-opened wireshark, to capture again, tried resetting my password again....and it went through immediately.

I tried a few more times, then reset it back. It didn't have an issue.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

darkpixel commented 1 year ago

Due to all the chaos, I disabled checking for common words temporarily. It's intermittent on my end. I'm very roughly estimating around 20% of the password changes hang. It always seems to happen to the user, then they call me, then I connect in and get wireshark spooled up, then have them try again and it's instant.

Until I can come up with some useful debugging info or someone else is able to confirm this, I think the bug should be closed.