ReCaptcha Enterprise high TCP usage

Rassell commented 3 years ago

Hi all,

We are integrating reCaptcha enterprise solution on our proyect and the service is crashing with the following when there is a high usage of the reCaptcha client:

Status(StatusCode="Unavailable", Detail="failed to connect to all addresses", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1613648642.332000000","description":"Failed to pick subchannel","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x86\src\core\ext\filters\client_channel\client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1613648642.332000000","description":"failed to connect to all addresses","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x86\src\core\ext\filters\client_channel\lb_policy\pick_first\pick_first.cc","file_line":394,"grpc_status":14}]}")

And the following image is the tcp usage duing our performance tests on a 3xS2 webapp (Azure), you can see the difference between normal execution vs recatpcha gets enabled/working

This is our implementation of the code (following the recomendation of the documentation to use the same client, we have tried different ones but we still getting the same error)

public class ReCaptchaValidator : IReCaptchaValidator
    {
        private readonly ILogger<ReCaptchaValidator> _logger;
        private readonly ReCaptchaOptions _reCaptchaOptions;
        private readonly string _host;
        private readonly RecaptchaEnterpriseServiceClient _client;

        public ReCaptchaValidator(IOptionsMonitor<ReCaptchaOptions> optionsMonitor, ILogger<ReCaptchaValidator> logger, IConfiguration configuration)
        {
            _logger = logger;
            _reCaptchaOptions = optionsMonitor.CurrentValue;
            _host = configuration.GetValue<string>("Host");
            var builder = new RecaptchaEnterpriseServiceClientBuilder { JsonCredentials = _reCaptchaOptions.CredentialsJson };
            _client = builder.Build();
        }

        public async Task<bool> IsValid(string token, string action, CancellationToken cancellationToken = default)
        {
            if (!_reCaptchaOptions.Enabled)
            {
                return true;
            }

            // Initialize request argument(s)
            var createAssessmentRequest = new CreateAssessmentRequest
            {
                ParentAsProjectName = ProjectName.FromProject(_reCaptchaOptions.ProjectId),
                Assessment = new Assessment()
                {
                    Event = new Google.Cloud.RecaptchaEnterprise.V1.Event()
                    {
                        SiteKey = _reCaptchaOptions.SiteKey,
                        Token = token
                    }
                },
            };

            try
            {
                // Make the request
                var response = await _client.CreateAssessmentAsync(createAssessmentRequest, cancellationToken);

                _logger.LogInformation($"ReCaptcha response {JsonConvert.SerializeObject(response)}");

                return response != null
                    && response.TokenProperties.Hostname.Equals(_host.Replace("https://", string.Empty))
                    && response.TokenProperties.Valid
                    && response.TokenProperties.Action.Equals(action)
                    && response.RiskAnalysis?.Score >= _reCaptchaOptions.Threshold;
            }
            catch (Exception e)
            {
                _logger.LogError(e, "ReCaptcha: Error getting response");
                return true;
            }
        }
    }

jskeet commented 3 years ago

Note for future issues: it's a lot easier to read and work with code if it's copy/pasted as text rather than provided as a screenshot.

The code looks fine to me, and I'd expect this to just work.

A few questions:

I can't read the graph, in terms of either the key or the values... are you able to provide a less blurry version?
Do you experience any other network issues around the time that these calls fail?
Does the service recover after a while using the same client objects, or is it permanently broken once it's started failing?

I strongly suspect I'll need to get some of our gRPC folks involved (as the gRPC layer is one that I don't have detailed knowledge of) but the answers to those questions would be useful for them, I suspect.

Rassell commented 3 years ago

Note for future issues: it's a lot easier to read and work with code if it's copy/pasted as text rather than provided as a screenshot.

The code looks fine to me, and I'd expect this to just work.

A few questions:

I can't read the graph, in terms of either the key or the values... are you able to provide a less blurry version?

Do you experience any other network issues around the time that these calls fail?

Does the service recover after a while using the same client objects, or is it permanently broken once it's started failing?

I strongly suspect I'll need to get some of our gRPC folks involved (as the gRPC layer is one that I don't have detailed knowledge of) but the answers to those questions would be useful for them, I suspect.

Hi @jskeet

sorry for the blurry images, I will update the screenshot with a clear one and put the code as it is rather than screenshot.

No, no other network issues were found during these tests, we were testing this functionality after getting the error on production, because as you said we expected the code to just work. So lets say its a isolated env to test this.
It goes up until no tcp is available and causes the app to crash

jskeet commented 3 years ago

Lots more questions I'm afraid - please don't read any hostility into the level of questions here; there's just a lot I don't understand at the moment :)

Okay, now I can read the graphs, I'm afraid I still don't fully understand what they're saying. You said "you can see the difference between normal execution vs recatpcha gets enabled/working" - do you mean different instances are configured differently? Or that you enabled it at a different time?

Are these just graphs of established TCP connections (both directions? incoming? outgoing?)? Do you have graphs of error rates etc? The more we can understand what's going on, the better.

When you say no other network issues were found during these tests - is anything else trying to talk to any GCP services? (For example, if you try to make an API call to another service once a minute just as a sort of heartbeat, does that work?) Have you conducted the tests multiple times and seen similar results? If it works for a while and then starts failing, is that timing consistent?

I'm out of the office right now, but will contact a gRPC-based colleague later today to ask for assistance.

amanda-tarafa commented 3 years ago

I will add another question, just to make sure.

When you call the ReCaptchaValidator.IsValid method, are you awaiting the call (await validator.IsValid(...)) or blocking (validator.IsValid(...)Result). If you are blocking, that could very well explain increased latency and possibly resource depletion.

Rassell commented 3 years ago

Hi! fast answer @amanda-tarafa , we are not blocking it:

if (request.ReCaptchaEnabled && !await _reCaptchaValidator.IsValid(request.ReCaptchaToken, ReCaptchaActions.LOGIN, cancellationToken))

I will get more info @jskeet and answer you ;) (lot of work today on another areas! ahhaha)

Rassell commented 3 years ago

Lots more questions I'm afraid - please don't read any hostility into the level of questions here; there's just a lot I don't understand at the moment :)

Okay, now I can read the graphs, I'm afraid I still don't fully understand what they're saying. You said "you can see the difference between normal execution vs recatpcha gets enabled/working" - do you mean different instances are configured differently? Or that you enabled it at a different time?

Are these just graphs of established TCP connections (both directions? incoming? outgoing?)? Do you have graphs of error rates etc? The more we can understand what's going on, the better.

When you say no other network issues were found during these tests - is anything else trying to talk to any GCP services? (For example, if you try to make an API call to another service once a minute just as a sort of heartbeat, does that work?) Have you conducted the tests multiple times and seen similar results? If it works for a while and then starts failing, is that timing consistent?

I'm out of the office right now, but will contact a gRPC-based colleague later today to ask for assistance.

No, I mean that the TCP connections go to high (at least for my point of view) when the code starts connecting to the google services

Let me ask my collegues for more info about the graphs as im on a very slow network and its imposible to send images.

No there is no other talk to any GCP (we are only using ReCaptcha, for now). Yes those tests were run multiple times with the sames results, we didnt check the timing.

jskeet commented 3 years ago

Just coming back to this again - could you add logging for your ReCaptchaValidator constructor? I wonder whether some DI configuration issue means it's being created on every request rather than acting as the singleton I'd expect it to be. If you're creating a new RecaptchaEnterpriseServiceClient on every request (and with a specific JSON credential rather than the default credential, which allows channel reuse) then that could explain things.

Rassell commented 3 years ago

hi @jskeet

find attached a screenshot with a sample text setted on the constructor (on a 250k calls, more or less, only 1 log was done):

jskeet commented 3 years ago

@Rassell: Okay, that's good to know.

I'm currently lost as to why you'd end up with so many TCP connections then. I wonder whether the gRPC library is initializing more connections in order to meet demand... do you know what request rate you're achieving, roughly?

Also, you've talked about this happening on Azure - are you able to try reproducing it using local machines, potentially just issuing recaptcha requests? (I wonder whether Azure is rate limiting the outward bound connections somehow?) I'm not quite sure where to go next in diagnosing this, although I'll refer this question to a gRPC team member...

Rassell commented 3 years ago

Hi @jskeet

we are on 478 more or less per second during our tests but with number of threads (users) 50 , using jmeter for our tests

we didnt test anything on local at the moment, we will take a look to double check

also we are getting this error during our tests (not in all requests):

Status(StatusCode="Unavailable", Detail="GOAWAY received", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1614697987.477000000","description":"Error received from peer ipv4:172.217.17.42:443","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x86\src\core\lib\surface\call.cc","file_line":1062,"grpc_message":"GOAWAY received","grpc_status":14}")

Rassell commented 3 years ago

Hi all,

We are doing some tests and we have found that using "normal" httpclients the cpu usage and tcp are "stables":

using Google.Apis.Auth.OAuth2;
using Google.Apis.Util.Store;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Newtonsoft.Json;
using Newtonsoft.Json.Serialization;
using System;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Security.Cryptography.X509Certificates;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace Lidl.Sso.ReCaptcha
{

    public class ReCaptchaValidator : IReCaptchaValidator
    {
        private readonly ILogger<ReCaptchaValidator> _logger;
        private readonly ReCaptchaOptions _reCaptchaOptions;
        private readonly string _host;
        private readonly HttpClient _httpClient;

        public ReCaptchaValidator(IOptionsMonitor<ReCaptchaOptions> optionsMonitor, ILogger<ReCaptchaValidator> logger, IConfiguration configuration, HttpClient client)
        {
            _logger = logger;
            _reCaptchaOptions = optionsMonitor.CurrentValue;
            _host = configuration.GetValue<string>("Host");
            _httpClient = client;
        }

        public async Task<bool> IsValid(string token, string action, CancellationToken cancellationToken = default)
        {
            if (!_reCaptchaOptions.Enabled)
            {
                return true;
            }

            // Initialize request argument(s)
            var createAssessmentRequest = new Request
            {
                Event = new Event
                {
                    SiteKey = _reCaptchaOptions.SiteKey,
                    Token = token
                }
            };

            try
            {
                var stringContent = new StringContent(JsonConvert.SerializeObject(createAssessmentRequest), Encoding.UTF8, "application/json");

                // Make the request
                _httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", "");
                var response = await _httpClient.PostAsync($"https://recaptchaenterprise.googleapis.com/v1/projects/{_reCaptchaOptions.ProjectId}/assessments", stringContent, cancellationToken);

                if (!response.IsSuccessStatusCode)
                {
                    return false;
                }

                var json = await response.Content.ReadAsStringAsync();
                var result = JsonConvert.DeserializeObject<Response>(json);

                _logger.LogInformation($"ReCaptcha response {JsonConvert.SerializeObject(response)}");

                return result != null
                    && result.TokenProperties.Hostname.Equals(_host.Replace("https://", string.Empty))
                    && result.TokenProperties.Valid
                    && result.TokenProperties.Action.Equals(action)
                    && result.RiskAnalysis?.Score >= _reCaptchaOptions.Threshold;
            }
            catch (Exception e)
            {
                _logger.LogError(e, "ReCaptcha: Error getting response");
                return true;
            }
        }
    }

    class Event
    {
        [JsonProperty("token")]
        public string Token { get; set; }

        [JsonProperty("siteKey")]
        public string SiteKey { get; set; }
    }

    class Request
    {
        [JsonProperty("event")]
        public Event Event { get; set; }
    }

    class Response
    {
        public string Name { get; set; }

        public Event Event { get; set; }

        public TokenProperties TokenProperties { get; set; }

        public Google.Cloud.RecaptchaEnterprise.V1.RiskAnalysis RiskAnalysis { get; set; }
    }

    public class TokenProperties
    {
        public bool Valid { get; set; }
        public string InvalidReason { get; set; }
        public string Hostname { get; set; }
        public string Action { get; set; }
    }
}

services.AddHttpClient<IReCaptchaValidator, ReCaptchaValidator>();
services.AddTransient<IReCaptchaValidator, ReCaptchaValidator>();

the only issue with this is the auth_token that we are not able to generate this, only with powershell with the gcloud console, is there any way to generate this token?

jskeet commented 3 years ago

You can certainly use GoogleCredential to generate auth tokens (and refresh them automatically). You'll need a dependency on Google.Apis.Auth (which you'll already have transitively if you've still got Google.Cloud.RecaptchaEnterprise.V1 installed.

// In the constructor (assigning to a field)
ICredential credential = GoogleCredential.FromJson(_reCaptchaOptions.CredentialsJson);

// On each request:
string accessToken = await credential.GetAccessTokenForRequestAsync();

However, doing the serialization yourself like this has significant risks - the generated protobuf classes can sometimes have property names which may not deserialize directly using Json.NET. It would be somewhat better to just use the protobuf classes directly, e.g.:

var request = new Assessment
{
    Event = new Event { ... }
};
var stringContent = new StringContent(request.ToString(),  Encoding.UTF8, "application/json");
...
var result = Assessment.Parser.ParsJson(json);

(You can then get rid of your TokenProperties, Event, Request and Response classes.)

That still assumes the JSON used by the service exactly matches the proto3 representation. That's probably true - and it's a better approach than using Json.NET with the protobuf classes - but using gRPC would definitely be better. I'm hoping we can resolve why there's such a high usage level here.

Have you had any luck reproducing this outside Azure, ideally with just a console app firing requests?

Rassell commented 3 years ago

Hi @jskeet

Thanks for the fast answers, about the

// In the constructor (assigning to a field)
ICredential credential = GoogleCredential.FromJson(_reCaptchaOptions.CredentialsJson);

// On each request:
string accessToken = await credential.GetAccessTokenForRequestAsync();

we are getting:

Error [string]:"invalid_scope"
ErrorDescription [string]:"Invalid OAuth scope or ID token audience provided."

I have also tried to put "cloud-platform" on the GetAccessTokenForRequestAsync but got a 401

Also thanks for pointing the issue with the json.

I will try to reproduce this on local asap

jskeet commented 3 years ago

Ah, okay. Fortunately that's easy to fix. I'd hoped it would just work with a self-signed JWT, but I'm hopeful that's easy to fix:

ICredential credential = GoogleCredential
    .FromJson(_reCaptchaOptions.CredentialsJson)
    .CreateScoped(RecaptchaEnterpriseServiceClient.DefaultScopes);

Rassell commented 3 years ago

Ah, okay. Fortunately that's easy to fix. I'd hoped it would just work with a self-signed JWT, but I'm hopeful that's easy to fix:
ICredential credential = GoogleCredential
    .FromJson(_reCaptchaOptions.CredentialsJson)
    .CreateScoped(RecaptchaEnterpriseServiceClient.DefaultScopes);

working! thanks!

jskeet commented 3 years ago

Great! I'll keep this issue open though, as I still want to understand what's leading to the high TCP count. Hopefully this has at least unblocked you for now though.

Rassell commented 3 years ago

@jskeet hi! we tested "localy" (only library, one endpoint on a rest api calling on the reCaptcha nuget) and see that the problem reproduces

jskeet commented 3 years ago

Okay, thanks. That's good to know. Is that still with your full web app, or have you reproduced it in a minimal way I'd be able to test here? Not to worry if not - just thought I'd ask. Still hoping to get the help of the gRPC folks here.

jskeet commented 3 years ago

I've managed to reproduce this to at least a reasonable extent with the following code. @Rassell could you have a look and see if this seems like a fair representation to you? It's quick and dirty, but does seem to provoke the library to create more TCP connections than I'd expect. Not into the hundreds, but definitely more than I'd normally expect. I may have inappropriate expectations about the number of concurrent calls a single gRPC network connection "wants" to handle before the underlying library decides to use a new connection.

(Unsurprisingly, under heavy traffic the servers start responding with GOAWAY messages, so there's only so far we can push this...)

using Google.Api.Gax.ResourceNames;
using Google.Cloud.RecaptchaEnterprise.V1;
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;

namespace Issue5978
{
    class Program
    {
        private const int Threads = 50;
        private const int CallsPerThread = 1000;
        private const int CallsPerLog = 10;
        private static readonly TimeSpan DelayBetweenCalls = TimeSpan.FromMilliseconds(50);
        private static readonly TimeSpan DelayAfterFailure = TimeSpan.FromSeconds(10);
        private static RecaptchaEnterpriseServiceClient client;
        private static CreateAssessmentRequest request;

        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Arguments: <project ID> <site key>");
                return;
            }
            // We're issuing the exact same request all the time.
            request = new CreateAssessmentRequest
            {
                ParentAsProjectName = ProjectName.FromProject(args[0]),
                Assessment = new Assessment
                {
                    // Note: no token, deliberately to avoid us having to create them.
                    Event = new Event { SiteKey = args[1] }
                }
            };

            client = RecaptchaEnterpriseServiceClient.Create();
            var threads = Enumerable.Range(0, Threads).Select(_ => new Thread(MakeCalls)).ToList();
            threads.ForEach(t => t.Start());
            threads.ForEach(t => t.Join());
        }

        static void MakeCalls()
        {
            var threadId = Thread.CurrentThread.ManagedThreadId;
            Stopwatch sw = new Stopwatch();
            int failures = 0;
            for (int i = 0; i < CallsPerThread; i++)
            {
                if (i % CallsPerLog == 0)
                {
                    int latency = i > 0 ? (int) (sw.ElapsedMilliseconds / i) : 0;
                    Console.WriteLine($"Thread ID {threadId} making call {i}; average latency={latency}ms; failures {failures}");
                }
                sw.Start();
                try
                {
                    client.CreateAssessment(request);
                    sw.Stop();
                }
                catch (Exception ex)
                {
                    sw.Stop();
                    Console.WriteLine($"Failure in thread ID {threadId}: {ex.Message}");
                    failures++;
                    Thread.Sleep(DelayAfterFailure);
                }
                Thread.Sleep(DelayBetweenCalls);
            }
        }
    }
}

jtattermusch commented 3 years ago

This looks interesting. Have you tried running with extra tracing on? How to get extra debug logs & traces: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md

I'd try running with GRPC_VERBOSITY=debug GRPC_TRACE=subchannel (which should show you TCP connections/disconnections). There might be other traces that might reveal something interesting - see the list here https://github.com/grpc/grpc/blob/master/doc/environment_variables.md

From the logs it looks like you're running a 32bit process on windows.

which .NET framework are you using? (.NET Core or .NET Framework)?
can you try switching your app to 64bit and see if that helps at all?
if you had the opportunity to run under linux (I understand that might be tricky depending on your setup), this might provide valuable insights on whether this is a windows-stack related problem or if it's agnostic.
one thing you might want try is to configure the client library to use the https://github.com/grpc/grpc-dotnetgRPC client instead of Grpc.Core - just to see if that changes anything. You'd need to be on .NET Core 3+ in order to do that.

jtattermusch commented 3 years ago

I've managed to reproduce this to at least a reasonable extent with the following code. @Rassell could you have a look and see if this seems like a fair representation to you? It's quick and dirty, but does seem to provoke the library to create more TCP connections than I'd expect. Not into the hundreds, but definitely more than I'd normally expect. I may have inappropriate expectations about the number of concurrent calls a single gRPC network connection "wants" to handle before the underlying library decides to use a new connection.

(Unsurprisingly, under heavy traffic the servers start responding with GOAWAY messages, so there's only so far we can push this...)

using Google.Api.Gax.ResourceNames;
using Google.Cloud.RecaptchaEnterprise.V1;
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;

namespace Issue5978
{
    class Program
    {
        private const int Threads = 50;
        private const int CallsPerThread = 1000;
        private const int CallsPerLog = 10;
        private static readonly TimeSpan DelayBetweenCalls = TimeSpan.FromMilliseconds(50);
        private static readonly TimeSpan DelayAfterFailure = TimeSpan.FromSeconds(10);
        private static RecaptchaEnterpriseServiceClient client;
        private static CreateAssessmentRequest request;

        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Arguments: <project ID> <site key>");
                return;
            }
            // We're issuing the exact same request all the time.
            request = new CreateAssessmentRequest
            {
                ParentAsProjectName = ProjectName.FromProject(args[0]),
                Assessment = new Assessment
                {
                    // Note: no token, deliberately to avoid us having to create them.
                    Event = new Event { SiteKey = args[1] }
                }
            };

            client = RecaptchaEnterpriseServiceClient.Create();
            var threads = Enumerable.Range(0, Threads).Select(_ => new Thread(MakeCalls)).ToList();
            threads.ForEach(t => t.Start());
            threads.ForEach(t => t.Join());
        }

        static void MakeCalls()
        {
            var threadId = Thread.CurrentThread.ManagedThreadId;
            Stopwatch sw = new Stopwatch();
            int failures = 0;
            for (int i = 0; i < CallsPerThread; i++)
            {
                if (i % CallsPerLog == 0)
                {
                    int latency = i > 0 ? (int) (sw.ElapsedMilliseconds / i) : 0;
                    Console.WriteLine($"Thread ID {threadId} making call {i}; average latency={latency}ms; failures {failures}");
                }
                sw.Start();
                try
                {
                    client.CreateAssessment(request);
                    sw.Stop();
                }
                catch (Exception ex)
                {
                    sw.Stop();
                    Console.WriteLine($"Failure in thread ID {threadId}: {ex.Message}");
                    failures++;
                    Thread.Sleep(DelayAfterFailure);
                }
                Thread.Sleep(DelayBetweenCalls);
            }
        }
    }
}

Is this example using a grpc channel pool? How many distinct grpc channels get created under the hood probably has influence on how many TCP connections get created (the relationship between the two is not trivial as grpc has some logic for subchannel sharing, but there is some relationship nevertheless). Does setting GRPC_VERBOSITY=debug GRPC_TRACE=subchannel give any useful insights?

jskeet commented 3 years ago

Is this example using a grpc channel pool?

It's using a single gRPC Channel as far as the client library is concerned - we only create at most a single Grpc.Core.Channel per client, and all the calls here are using a single client. What that Channel does internally is an implementation detail we assume we can ignore - and it's the important implementation detail here, of course :)

Does setting GRPC_VERBOSITY=debug GRPC_TRACE=subchannel give any useful insights?

Will give it a try, although I expect there to be a lot of output to go through.

jskeet commented 3 years ago

Well this is interesting... I've just tried it again, both with and without the extra logging - and the behavior has changed. I get a steady single network connection for gRPC, even increasing to 100 threads.

I wonder whether there's been a network configuration change that has fixed this. @Rassell are you able to retest this yourself relatively easily? It would be great to find out the issue has just been resolved for us :)

jskeet commented 3 years ago

Hi @Rassell: are you able to retest this? I believe the problem has been resolved, probably due to a network reconfiguration somewhere. It would be good to hear that you're seeing the same change though.

jskeet commented 3 years ago

Closing now as I believe it's fine, and we haven't heard any more information for 12 days.

googleapis / google-cloud-dotnet

ReCaptcha Enterprise high TCP usage #5978