Closed Rassell closed 3 years ago
Note for future issues: it's a lot easier to read and work with code if it's copy/pasted as text rather than provided as a screenshot.
The code looks fine to me, and I'd expect this to just work.
A few questions:
I strongly suspect I'll need to get some of our gRPC folks involved (as the gRPC layer is one that I don't have detailed knowledge of) but the answers to those questions would be useful for them, I suspect.
Note for future issues: it's a lot easier to read and work with code if it's copy/pasted as text rather than provided as a screenshot.
The code looks fine to me, and I'd expect this to just work.
A few questions:
- I can't read the graph, in terms of either the key or the values... are you able to provide a less blurry version?
- Do you experience any other network issues around the time that these calls fail?
- Does the service recover after a while using the same client objects, or is it permanently broken once it's started failing?
I strongly suspect I'll need to get some of our gRPC folks involved (as the gRPC layer is one that I don't have detailed knowledge of) but the answers to those questions would be useful for them, I suspect.
Hi @jskeet
sorry for the blurry images, I will update the screenshot with a clear one and put the code as it is rather than screenshot.
Lots more questions I'm afraid - please don't read any hostility into the level of questions here; there's just a lot I don't understand at the moment :)
Okay, now I can read the graphs, I'm afraid I still don't fully understand what they're saying. You said "you can see the difference between normal execution vs recatpcha gets enabled/working" - do you mean different instances are configured differently? Or that you enabled it at a different time?
Are these just graphs of established TCP connections (both directions? incoming? outgoing?)? Do you have graphs of error rates etc? The more we can understand what's going on, the better.
When you say no other network issues were found during these tests - is anything else trying to talk to any GCP services? (For example, if you try to make an API call to another service once a minute just as a sort of heartbeat, does that work?) Have you conducted the tests multiple times and seen similar results? If it works for a while and then starts failing, is that timing consistent?
I'm out of the office right now, but will contact a gRPC-based colleague later today to ask for assistance.
I will add another question, just to make sure.
When you call the ReCaptchaValidator.IsValid
method, are you awaiting the call (await validator.IsValid(...)
) or blocking (validator.IsValid(...)Result
). If you are blocking, that could very well explain increased latency and possibly resource depletion.
Hi! fast answer @amanda-tarafa , we are not blocking it:
if (request.ReCaptchaEnabled && !await _reCaptchaValidator.IsValid(request.ReCaptchaToken, ReCaptchaActions.LOGIN, cancellationToken))
I will get more info @jskeet and answer you ;) (lot of work today on another areas! ahhaha)
Lots more questions I'm afraid - please don't read any hostility into the level of questions here; there's just a lot I don't understand at the moment :)
Okay, now I can read the graphs, I'm afraid I still don't fully understand what they're saying. You said "you can see the difference between normal execution vs recatpcha gets enabled/working" - do you mean different instances are configured differently? Or that you enabled it at a different time?
Are these just graphs of established TCP connections (both directions? incoming? outgoing?)? Do you have graphs of error rates etc? The more we can understand what's going on, the better.
When you say no other network issues were found during these tests - is anything else trying to talk to any GCP services? (For example, if you try to make an API call to another service once a minute just as a sort of heartbeat, does that work?) Have you conducted the tests multiple times and seen similar results? If it works for a while and then starts failing, is that timing consistent?
I'm out of the office right now, but will contact a gRPC-based colleague later today to ask for assistance.
No, I mean that the TCP connections go to high (at least for my point of view) when the code starts connecting to the google services
Let me ask my collegues for more info about the graphs as im on a very slow network and its imposible to send images.
No there is no other talk to any GCP (we are only using ReCaptcha, for now). Yes those tests were run multiple times with the sames results, we didnt check the timing.
Just coming back to this again - could you add logging for your ReCaptchaValidator
constructor? I wonder whether some DI configuration issue means it's being created on every request rather than acting as the singleton I'd expect it to be. If you're creating a new RecaptchaEnterpriseServiceClient
on every request (and with a specific JSON credential rather than the default credential, which allows channel reuse) then that could explain things.
hi @jskeet
find attached a screenshot with a sample text setted on the constructor (on a 250k calls, more or less, only 1 log was done):
@Rassell: Okay, that's good to know.
I'm currently lost as to why you'd end up with so many TCP connections then. I wonder whether the gRPC library is initializing more connections in order to meet demand... do you know what request rate you're achieving, roughly?
Also, you've talked about this happening on Azure - are you able to try reproducing it using local machines, potentially just issuing recaptcha requests? (I wonder whether Azure is rate limiting the outward bound connections somehow?) I'm not quite sure where to go next in diagnosing this, although I'll refer this question to a gRPC team member...
Hi @jskeet
we are on 478 more or less per second during our tests but with number of threads (users) 50 , using jmeter for our tests
we didnt test anything on local at the moment, we will take a look to double check
also we are getting this error during our tests (not in all requests):
Status(StatusCode="Unavailable", Detail="GOAWAY received", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1614697987.477000000","description":"Error received from peer ipv4:172.217.17.42:443","file":"T:\src\github\grpc\workspace_csharp_ext_windows_x86\src\core\lib\surface\call.cc","file_line":1062,"grpc_message":"GOAWAY received","grpc_status":14}")
Hi all,
We are doing some tests and we have found that using "normal" httpclients the cpu usage and tcp are "stables":
using Google.Apis.Auth.OAuth2;
using Google.Apis.Util.Store;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Newtonsoft.Json;
using Newtonsoft.Json.Serialization;
using System;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Security.Cryptography.X509Certificates;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace Lidl.Sso.ReCaptcha
{
public class ReCaptchaValidator : IReCaptchaValidator
{
private readonly ILogger<ReCaptchaValidator> _logger;
private readonly ReCaptchaOptions _reCaptchaOptions;
private readonly string _host;
private readonly HttpClient _httpClient;
public ReCaptchaValidator(IOptionsMonitor<ReCaptchaOptions> optionsMonitor, ILogger<ReCaptchaValidator> logger, IConfiguration configuration, HttpClient client)
{
_logger = logger;
_reCaptchaOptions = optionsMonitor.CurrentValue;
_host = configuration.GetValue<string>("Host");
_httpClient = client;
}
public async Task<bool> IsValid(string token, string action, CancellationToken cancellationToken = default)
{
if (!_reCaptchaOptions.Enabled)
{
return true;
}
// Initialize request argument(s)
var createAssessmentRequest = new Request
{
Event = new Event
{
SiteKey = _reCaptchaOptions.SiteKey,
Token = token
}
};
try
{
var stringContent = new StringContent(JsonConvert.SerializeObject(createAssessmentRequest), Encoding.UTF8, "application/json");
// Make the request
_httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", "");
var response = await _httpClient.PostAsync($"https://recaptchaenterprise.googleapis.com/v1/projects/{_reCaptchaOptions.ProjectId}/assessments", stringContent, cancellationToken);
if (!response.IsSuccessStatusCode)
{
return false;
}
var json = await response.Content.ReadAsStringAsync();
var result = JsonConvert.DeserializeObject<Response>(json);
_logger.LogInformation($"ReCaptcha response {JsonConvert.SerializeObject(response)}");
return result != null
&& result.TokenProperties.Hostname.Equals(_host.Replace("https://", string.Empty))
&& result.TokenProperties.Valid
&& result.TokenProperties.Action.Equals(action)
&& result.RiskAnalysis?.Score >= _reCaptchaOptions.Threshold;
}
catch (Exception e)
{
_logger.LogError(e, "ReCaptcha: Error getting response");
return true;
}
}
}
class Event
{
[JsonProperty("token")]
public string Token { get; set; }
[JsonProperty("siteKey")]
public string SiteKey { get; set; }
}
class Request
{
[JsonProperty("event")]
public Event Event { get; set; }
}
class Response
{
public string Name { get; set; }
public Event Event { get; set; }
public TokenProperties TokenProperties { get; set; }
public Google.Cloud.RecaptchaEnterprise.V1.RiskAnalysis RiskAnalysis { get; set; }
}
public class TokenProperties
{
public bool Valid { get; set; }
public string InvalidReason { get; set; }
public string Hostname { get; set; }
public string Action { get; set; }
}
}
services.AddHttpClient<IReCaptchaValidator, ReCaptchaValidator>();
services.AddTransient<IReCaptchaValidator, ReCaptchaValidator>();
the only issue with this is the auth_token that we are not able to generate this, only with powershell with the gcloud console, is there any way to generate this token?
You can certainly use GoogleCredential
to generate auth tokens (and refresh them automatically). You'll need a dependency on Google.Apis.Auth
(which you'll already have transitively if you've still got Google.Cloud.RecaptchaEnterprise.V1
installed.
// In the constructor (assigning to a field)
ICredential credential = GoogleCredential.FromJson(_reCaptchaOptions.CredentialsJson);
// On each request:
string accessToken = await credential.GetAccessTokenForRequestAsync();
However, doing the serialization yourself like this has significant risks - the generated protobuf classes can sometimes have property names which may not deserialize directly using Json.NET. It would be somewhat better to just use the protobuf classes directly, e.g.:
var request = new Assessment
{
Event = new Event { ... }
};
var stringContent = new StringContent(request.ToString(), Encoding.UTF8, "application/json");
...
var result = Assessment.Parser.ParsJson(json);
(You can then get rid of your TokenProperties
, Event
, Request
and Response
classes.)
That still assumes the JSON used by the service exactly matches the proto3 representation. That's probably true - and it's a better approach than using Json.NET with the protobuf classes - but using gRPC would definitely be better. I'm hoping we can resolve why there's such a high usage level here.
Have you had any luck reproducing this outside Azure, ideally with just a console app firing requests?
Hi @jskeet
Thanks for the fast answers, about the
// In the constructor (assigning to a field)
ICredential credential = GoogleCredential.FromJson(_reCaptchaOptions.CredentialsJson);
// On each request:
string accessToken = await credential.GetAccessTokenForRequestAsync();
we are getting:
Error [string]:"invalid_scope"
ErrorDescription [string]:"Invalid OAuth scope or ID token audience provided."
I have also tried to put "cloud-platform" on the GetAccessTokenForRequestAsync but got a 401
Also thanks for pointing the issue with the json.
I will try to reproduce this on local asap
Ah, okay. Fortunately that's easy to fix. I'd hoped it would just work with a self-signed JWT, but I'm hopeful that's easy to fix:
ICredential credential = GoogleCredential
.FromJson(_reCaptchaOptions.CredentialsJson)
.CreateScoped(RecaptchaEnterpriseServiceClient.DefaultScopes);
Ah, okay. Fortunately that's easy to fix. I'd hoped it would just work with a self-signed JWT, but I'm hopeful that's easy to fix:
ICredential credential = GoogleCredential .FromJson(_reCaptchaOptions.CredentialsJson) .CreateScoped(RecaptchaEnterpriseServiceClient.DefaultScopes);
working! thanks!
Great! I'll keep this issue open though, as I still want to understand what's leading to the high TCP count. Hopefully this has at least unblocked you for now though.
@jskeet hi! we tested "localy" (only library, one endpoint on a rest api calling on the reCaptcha nuget) and see that the problem reproduces
Okay, thanks. That's good to know. Is that still with your full web app, or have you reproduced it in a minimal way I'd be able to test here? Not to worry if not - just thought I'd ask. Still hoping to get the help of the gRPC folks here.
I've managed to reproduce this to at least a reasonable extent with the following code. @Rassell could you have a look and see if this seems like a fair representation to you? It's quick and dirty, but does seem to provoke the library to create more TCP connections than I'd expect. Not into the hundreds, but definitely more than I'd normally expect. I may have inappropriate expectations about the number of concurrent calls a single gRPC network connection "wants" to handle before the underlying library decides to use a new connection.
(Unsurprisingly, under heavy traffic the servers start responding with GOAWAY messages, so there's only so far we can push this...)
using Google.Api.Gax.ResourceNames;
using Google.Cloud.RecaptchaEnterprise.V1;
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;
namespace Issue5978
{
class Program
{
private const int Threads = 50;
private const int CallsPerThread = 1000;
private const int CallsPerLog = 10;
private static readonly TimeSpan DelayBetweenCalls = TimeSpan.FromMilliseconds(50);
private static readonly TimeSpan DelayAfterFailure = TimeSpan.FromSeconds(10);
private static RecaptchaEnterpriseServiceClient client;
private static CreateAssessmentRequest request;
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Arguments: <project ID> <site key>");
return;
}
// We're issuing the exact same request all the time.
request = new CreateAssessmentRequest
{
ParentAsProjectName = ProjectName.FromProject(args[0]),
Assessment = new Assessment
{
// Note: no token, deliberately to avoid us having to create them.
Event = new Event { SiteKey = args[1] }
}
};
client = RecaptchaEnterpriseServiceClient.Create();
var threads = Enumerable.Range(0, Threads).Select(_ => new Thread(MakeCalls)).ToList();
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
}
static void MakeCalls()
{
var threadId = Thread.CurrentThread.ManagedThreadId;
Stopwatch sw = new Stopwatch();
int failures = 0;
for (int i = 0; i < CallsPerThread; i++)
{
if (i % CallsPerLog == 0)
{
int latency = i > 0 ? (int) (sw.ElapsedMilliseconds / i) : 0;
Console.WriteLine($"Thread ID {threadId} making call {i}; average latency={latency}ms; failures {failures}");
}
sw.Start();
try
{
client.CreateAssessment(request);
sw.Stop();
}
catch (Exception ex)
{
sw.Stop();
Console.WriteLine($"Failure in thread ID {threadId}: {ex.Message}");
failures++;
Thread.Sleep(DelayAfterFailure);
}
Thread.Sleep(DelayBetweenCalls);
}
}
}
}
This looks interesting. Have you tried running with extra tracing on? How to get extra debug logs & traces: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md
I'd try running with GRPC_VERBOSITY=debug GRPC_TRACE=subchannel (which should show you TCP connections/disconnections). There might be other traces that might reveal something interesting - see the list here https://github.com/grpc/grpc/blob/master/doc/environment_variables.md
From the logs it looks like you're running a 32bit process on windows.
I've managed to reproduce this to at least a reasonable extent with the following code. @Rassell could you have a look and see if this seems like a fair representation to you? It's quick and dirty, but does seem to provoke the library to create more TCP connections than I'd expect. Not into the hundreds, but definitely more than I'd normally expect. I may have inappropriate expectations about the number of concurrent calls a single gRPC network connection "wants" to handle before the underlying library decides to use a new connection.
(Unsurprisingly, under heavy traffic the servers start responding with GOAWAY messages, so there's only so far we can push this...)
using Google.Api.Gax.ResourceNames; using Google.Cloud.RecaptchaEnterprise.V1; using System; using System.Diagnostics; using System.Linq; using System.Threading; namespace Issue5978 { class Program { private const int Threads = 50; private const int CallsPerThread = 1000; private const int CallsPerLog = 10; private static readonly TimeSpan DelayBetweenCalls = TimeSpan.FromMilliseconds(50); private static readonly TimeSpan DelayAfterFailure = TimeSpan.FromSeconds(10); private static RecaptchaEnterpriseServiceClient client; private static CreateAssessmentRequest request; static void Main(string[] args) { if (args.Length != 2) { Console.WriteLine("Arguments: <project ID> <site key>"); return; } // We're issuing the exact same request all the time. request = new CreateAssessmentRequest { ParentAsProjectName = ProjectName.FromProject(args[0]), Assessment = new Assessment { // Note: no token, deliberately to avoid us having to create them. Event = new Event { SiteKey = args[1] } } }; client = RecaptchaEnterpriseServiceClient.Create(); var threads = Enumerable.Range(0, Threads).Select(_ => new Thread(MakeCalls)).ToList(); threads.ForEach(t => t.Start()); threads.ForEach(t => t.Join()); } static void MakeCalls() { var threadId = Thread.CurrentThread.ManagedThreadId; Stopwatch sw = new Stopwatch(); int failures = 0; for (int i = 0; i < CallsPerThread; i++) { if (i % CallsPerLog == 0) { int latency = i > 0 ? (int) (sw.ElapsedMilliseconds / i) : 0; Console.WriteLine($"Thread ID {threadId} making call {i}; average latency={latency}ms; failures {failures}"); } sw.Start(); try { client.CreateAssessment(request); sw.Stop(); } catch (Exception ex) { sw.Stop(); Console.WriteLine($"Failure in thread ID {threadId}: {ex.Message}"); failures++; Thread.Sleep(DelayAfterFailure); } Thread.Sleep(DelayBetweenCalls); } } } }
Is this example using a grpc channel pool? How many distinct grpc channels get created under the hood probably has influence on how many TCP connections get created (the relationship between the two is not trivial as grpc has some logic for subchannel sharing, but there is some relationship nevertheless).
Does setting GRPC_VERBOSITY=debug GRPC_TRACE=subchannel
give any useful insights?
Is this example using a grpc channel pool?
It's using a single gRPC Channel
as far as the client library is concerned - we only create at most a single Grpc.Core.Channel
per client, and all the calls here are using a single client. What that Channel
does internally is an implementation detail we assume we can ignore - and it's the important implementation detail here, of course :)
Does setting GRPC_VERBOSITY=debug GRPC_TRACE=subchannel give any useful insights?
Will give it a try, although I expect there to be a lot of output to go through.
Well this is interesting... I've just tried it again, both with and without the extra logging - and the behavior has changed. I get a steady single network connection for gRPC, even increasing to 100 threads.
I wonder whether there's been a network configuration change that has fixed this. @Rassell are you able to retest this yourself relatively easily? It would be great to find out the issue has just been resolved for us :)
Hi @Rassell: are you able to retest this? I believe the problem has been resolved, probably due to a network reconfiguration somewhere. It would be good to hear that you're seeing the same change though.
Closing now as I believe it's fine, and we haven't heard any more information for 12 days.
Hi all,
We are integrating reCaptcha enterprise solution on our proyect and the service is crashing with the following when there is a high usage of the reCaptcha client:
And the following image is the tcp usage duing our performance tests on a 3xS2 webapp (Azure), you can see the difference between normal execution vs recatpcha gets enabled/working
This is our implementation of the code (following the recomendation of the documentation to use the same client, we have tried different ones but we still getting the same error)