neos-sdi / adfsmfa

MFA for ADFS 2022/2019/2016/2012r2
MIT License
135 stars 51 forks source link

Timeouts if primary ADFS member is not online (Event ID 2011/1011) #317

Closed derSchweiger closed 10 months ago

derSchweiger commented 10 months ago

Hi,

we've deployed a redundant ADFS farm between two countries and use a classic WID setup. Both Active Directory sites have multiple Domain Controllers. For multi-factor-authentication we utilize this plugin. While failover-testing we've experienced longer timeouts, if the primary ADFS member server is not reachable. There is a delay of around 15 seconds when loading and providing the second factor. This delay only occurs, if the application enforces MFA (and therefore, loads the MFA plugin). If the application only requires a username and a password for authentication (not enforcing MFA) we do not experience any delay at all. Therefore, my guess is that the MFA plugin is responsible for this behaviour.

Authentication works just fine after this ~15s. My suggestion is, that the secondary ADFS member server (which is the only one online) tries to communicate with the primary (which is offline) and after a certain period if time it runs into a timeout. After this timeout has occurred, the secondary ADFS server takes over the authentication process and validates the first and second factor on its own.

We can see the following events: Error 2011 Error calling DispachTheme method : ADFS01.TEST.LOCAL => Could not connect to net.tcp://ADFS01.TEST.LOCAL:5987/WebThemesService. The connection attempt lasted for a time span of 00:00:21.0323783. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.61.242.20:5987. .

Error 1011 Error on Check Remote Service method : ADFS01.TEST.LOCAL => Could not connect to net.tcp://ADFS01.TEST.LOCAL:5987/ReplayService. The connection attempt lasted for a time span of 00:00:15. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.61.242.20:5987. .

Of course, we could promote the secondary ADFS server to the primary, but that would only make sense if we expect a longer downtime of the primary server. In terms of a short outage (or restart) this is not our desired behaviour.

Do you have any other suggestions on how to fix this issue? Is it possible to tweak the timeout values? Thank you in advance!

redhook62 commented 10 months ago

Hi, @derSchweiger

Yes of course TCP port 5987 must be open between your different ADFS servers on the farm. synchronization between the different servers during a modification of the configuration, or the Anti relay service use this port.

https://github.com/neos-sdi/adfsmfa/wiki/01-Installation#configure-windows-firewall-rules

https://github.com/neos-sdi/adfsmfa/wiki/04-System%20Management#manage-firewall-between-adfs-servers

regards

derSchweiger commented 10 months ago

Hi @redhook62,

yes - I do understand that all ADFS member servers need to communicate with the primary MFA notification hub under normal circumstances. But what if the primary server is not reachable for a short amount of time (lets say 30 minutes)? In my expectation the MFA plugin should handle this without any delays or interrupts. That's why you build a redundant ADFS infrastructure: to handle situations in which one ADFS member server is down (for whatever reason) and don't expect a downtime for the user.

So there is no possibility to improve this behaviour?

redhook62 commented 10 months ago

Hi, @derSchweiger

I understand your problem, however the calls between the different servers are made in TCP with encrypted flows. There are a few features that are replicated, including one in particular that may impact users at the time of logon: The Replay Service. This option is a security function, so in absolute terms it can only be deactivated explicitly in the configuration (yes! on a primary server only). It is not acceptable to disable these checks on the fly and re-enable them when it is better. a competent security manager cannot accept this situation. Furthermore, we cannot know if it is a server down, services stopped or "Network" problem for example. On the Replay Service the timeouts are as follows:

So yes, technically it is possible to manage this situation, but in absolute terms, it is necessary to restore the correct functioning of the ADFS farm. not having synchronization between the different servers (especially in WID configuration) is not recommended or supported.

What you can do now, on the primary server, is to deactivate Anti Replay and then synchronize your servers (restarting the MFA Service).

In the future, it will be possible to explicitly suspend server-by-server replication for a defined period of time (5 minutes, 15 minutes, 1 hour) to allow, for example, updating of machines.

For any impromptu communication problem, the problem must be resolved as a priority.

regards

derSchweiger commented 10 months ago

Hi @redhook62,

thank you very much for the explanation and I totally agree with you. We do not want to disable a security feature (replay detection) to mitigate this delay. What is your recommendation in the following scenario: We have 2 ADFS servers in two different sites. The primary ADFS server has a problem and went down. It is not possible to recovery this server in the short term. Therefore, we need to promote the secondary ADFS server the primary one. Will this also migrate the ADFS MFA primary role to this server? And if not, what do we have to do, to achieve this?

The question might be dumb but we need to know how to proceed in such an event. We rely heavily on ADFS and MFA as our primary source of authentication.

redhook62 commented 10 months ago

Hi @derSchweiger

Emergency

Set-AdfsSyncProperties -Role PrimaryComputer

When you have control of the old server (currently the Primary

Set-AdfsSyncProperties -Role SecondaryComputer -PrimaryComputerName <FQDN_ADFS_Primary>

Then if you have a problem with the signature or encryption certificates, you will have to move them by hand.

Now if you want a more permanent situation please provide more explanations.

Why not have a platform on a single site (redundant NLB and maybe not high availability SQL AlwaysOn HA), it doesn't matter ADFS authentication, it's web, no need for the ADFS servers to be in the same LAN as the applications.

On 2 sites yes, I have a very "Huge"customer who manages all transport in the Ile-de-France region (Paris region), around 10 million daily travelers. For reasons of flooding risk this is distributed over two remote sites where everything is doubled (for ADFS, ADFS Proxies and SQL servers (because SQL configuration for ADFS), load balancers etc... A machine room on the ground floor(site A, and on the other site the 4th floor (site B).

Another solution, 2 distinct platforms (not the same URL, not the same domain name, not the same ADDS, etc...) and create co-federation, a trust between two federation entities. basically authorize a user from platform A to access a resource from platform B

And last point, why is this server permanently unavailable ????

regards

derSchweiger commented 10 months ago

Hi @redhook62.

our company is working globally and runs multiple datacenters around the world. Currently, we host ADFS servers in two different sites. If this deployment works well, we plan to expand this to one more site. In each site we have at least two domain controllers to provide redundant Active Directory services. All domain controllers are connected to each other and replicate their data.

We do this to ensure that our corporate IT works even if one site (datacenter) is temporary unavailable. Let's assume that one datacenter (our primary site) goes down. In this case, we would route all ADFS traffic to the second site automatically and ensure, that users are still able to authenticate. Of course, this situation is just temporary and our team would work on this issue to fix it as fast as possible and bring all services back online. But within this time frame (let's say 1h) the authentication with ADFS should work without any problem. While testing, I've experienced this short delay of ~20s while providing the first and second factor - as you said, this is related to the replay detection service. If we can mitigate this delay by promoting the secondary (and working) ADFS site to the primary than this is fine for us.

I'll test this scenario later on in our staging environment.

redhook62 commented 10 months ago

@derSchweiger

Thanks for these informations. However, I can tell you as a senior architect, a WID configuration is not serious, and especially not in line with the size and constraints of your company.

If you are doing ADDS replication, why not do it with SQL Server which is perfectly suited to this. Well, it's up to you, here, I'm only redhook

regards

derSchweiger commented 10 months ago

@redhook62 yep, totally agree on that. In mid-term we want to utilise MSSQL and high availability. This simple WID setup is more or less to evaluate and prove that ADFS will be our primary authentication service and that your plugin will provide us with the ability to implement the second factor (thank you very much for that!).

redhook62 commented 10 months ago

In mid-term... I sincerely think that you should do it as soon as possible...

Thanks

regards