googleapis / google-cloud-dotnet

Google Cloud Client Libraries for .NET
https://cloud.google.com/dotnet/docs/reference/
Apache License 2.0
930 stars 364 forks source link

Grpc.Core.RpcException: Status(StatusCode=Unauthenticated, Detail="Deadline Exceeded") ? #1903

Closed art-grig closed 6 years ago

art-grig commented 6 years ago

We are deploying asp.net core 2.0 project to app engine flex(while previous version is under load: about 200req/sec), and in some cases we are getting "Grpc.Core.RpcException: Status(StatusCode=Unauthenticated, Detail="Deadline Exceeded")" while trying to run query on datastore with standard datastore client, sometimes errors are gone and sometimes they are rising again, not sure if it depends on requests rate. So here is our app.yaml:

env: flex
runtime: aspnetcore
resources:
 cpu: 16
 memory_gb: 14.4

automatic_scaling:
  min_num_instances: 8
  max_num_instances: 20
  cool_down_period_sec: 180
  cpu_utilization:
    target_utilization: 0.5 

And here is full stack trace:

Grpc.Core.RpcException: Status(StatusCode=Unauthenticated, Detail="Deadline Exceeded")
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
at Grpc.Core.Internal.AsyncCall`2.UnaryCall
at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse]
at Grpc.Core.Internal.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse]
at Google.Cloud.Datastore.V1.Datastore.DatastoreClient.RunQuery
at Google.Api.Gax.Grpc.ApiCall.<>c__DisplayClass0_0`2.<Create>b__1
at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass1_0`2.<WithRetry>b__0
at Google.Cloud.Datastore.V1.QueryStreamer.<Sync>d__7.MoveNext
at System.Collections.Generic.List`1.AddEnumerable
at System.Linq.Enumerable.ToList[TSource]
at Google.Cloud.Datastore.V1.LazyDatastoreQuery.GetAllResults

@vuukle @jskeet

jskeet commented 6 years ago

@jtattermusch: Does this sound like it's anything to do with gRPC, or might it be from Datastore itself? It does seem like an odd combination of error status/message.

evildour commented 6 years ago

@jskeet, We're seeing something similar on a jenkins slave running endurance tests for the BigtableClient: Status(StatusCode=Unavailable, Detail="Deadline Exceeded")

I haven't tried reproducing it yet locally, so I can't give much more information, but it sounds like they may be related.

vuukle commented 6 years ago

Further to what @art-grig posted above we deployed our service live which receives approx 200-250req/sec and without making any changes the above error keeps happening every few hours.

Is there any timeout on data store? Does it have any kind of restrictions on concurrent requests? Initially we thought this was due to us using background tasks to fetch data from data store but we removed those and re-deployed and noticed the issue to reappear after a few hours.

This is happening on our production systems, appreciate any help

vuukle commented 6 years ago

@jskeet any help on this would be much appreciated, we are receiving almost 100K errors every hour

image

vuukle commented 6 years ago

According to the stack trace, this is where the exception happens

            res = _db.RunQuery(q).Entities?.AsEnumerable();
    public IEnumerable<Entity> GetByProperties(params Tuple<string, object>[] props)
    {
        IEnumerable<Entity> res = null;
        if (res == null)
        {
            var q = new Query(Kind);
            q.Filter = ConstructFilter(props);

            res = _db.RunQuery(q).Entities?.AsEnumerable();
        }
        return res;
    }
jskeet commented 6 years ago

@vuukle: I'll see if I get this when running my integration tests. Interesting that you're both seeing it with Datastore.

Does explicitly using a particular version of Grpc.Core help? It'll be using 1.7.1 by default, but you could add a dependency to get 1.9.0. At the moment I have very little idea why this is happening, but I'll ask Datastore folks internally too.

jskeet commented 6 years ago

(Having seen again that @evildour saw this with Bigtable, it sounds like it's more likely to be a gRPC issue than a Datastore issue. Too early to say for sure though.)

vuukle commented 6 years ago

@jskeet i'll try your suggestion. Also, i have noticed that sometimes on re-deploy the error vanishes for X hours

jskeet commented 6 years ago

@vuukle That's good information, thanks. For deployments that are failing, do you have any idea what the failure rate is compared with the success rate? (We can see that there are lots of failures, but we don't know whether that's 1%, 50% or 100% of the calls you're making...)

vuukle commented 6 years ago

@jskeet hope this helps

image

vuukle commented 6 years ago

for example, i just did a re-deploy a few minutes ago and errors have stopped, i'm sure if i do another re-deploy it will cause the errors to re-appear

vuukle commented 6 years ago

@jskeet

I just added the grpc 1.9.0 and deployed, now i am getting a ton of these

System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information. at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.d__11.MoveNext

jskeet commented 6 years ago

Yeesh - I'm really sorry about that. I don't understand why changing to gRPC 1.9.0 would cause problems with that at all.

So just to get the timeline with what you're seeing:

Is that accurate?

Can you let me know how many servers you're deploying to? If you've got 6 instances, your earlier graph would suggest that one of them may be failing all requests.

vuukle commented 6 years ago

Yes, and on re-deploying two ties the application default credentials are gone.

I also increased the instances to 10

env: flex runtime: aspnetcore resources: cpu: 16 memory_gb: 14.4

automatic_scaling: min_num_instances: 10 max_num_instances: 20 cool_down_period_sec: 180 cpu_utilization: target_utilization: 0.5

vuukle commented 6 years ago

also the upgrading to 1.9.0 has not helped, with the latest deploy the "deadline exceeded" error is back

image

jskeet commented 6 years ago

Okay, thanks for the details. This is very bizarre :( Does it seem to affect all operations? I'll try to reproduce with an application that simple fetches a Datastore entity, if it looks like that should fail.

jskeet commented 6 years ago

Can I just check - when using Grpc.Core 1.9.0, you've still got an explicit dependency on Google.Apis.Auth 1.32.2, yes?

vuukle commented 6 years ago

@jskeet image

For now we are only doing data store read operations, could you please simulate it with 200 concurrent users per sec for a long period of time. We tried loader.io for testing and it was fine but when we rolled it out to our production environment, it throws up the error above.

I am also on skype - ravikan2 if you need to see the errors on screenshare

jskeet commented 6 years ago

When you were using loader.io, how did you specify credentials? If this is all about authentication, it could well be that it makes a huge difference whether it's running on GCP or not. I wouldn't expect the load to be relevant to this, as once the credentials are established, it should be fine. (Not having used loader.io myself, I'm not quite sure what you're testing or how.)

Are you able to deploy your app to a second AppEngine service (e.g. just with one instance) where you can hit it manually? That would isolate whether or not it's load-related, as well as making it more convenient to experiment - while I try to reproduce it at the same time, of course.

vuukle commented 6 years ago

Its a public API - https://api.vuukle.com/api/v1/emote_ratings/getEmoteRatings?host=nbc-2.com&articleId=37535211 , i simply added the above api in loader.io to simulate upto 1000 concurrent connections per sec for 5minutes, and everything seemed fine, there were timeouts but about 0.1%

(btw i am not getting the authentication error anymore)

the only error i get now is RpcException: Status(StatusCode=DeadlineExceeded, Detail="Deadline Exceeded") Grpc.Core.Internal.AsyncCall`2.UnaryCall

In order to do a proper test we have added our api call as a hidden request to see if it will work in production. As you can see the above would give you a 200 OK response when hitting it manually but if i deploy it just one instance i am afraid it might crash because of the current load i mentioned(200req/sec)

jskeet commented 6 years ago

I was suggesting deploying as a separate service - so you wouldn't be getting 200 qps, you'd only be getting whatever traffic you explicitly sent to it.

I'm not quite sure what you mean about adding an API call "as a hidden request". I'm still clearing my backlog from a week's vacation, but should be able to try this soon. I'd really like to isolate the load testing part from anything else though - it's particularly tricky as there are the two different errors as well. I'd expect the authentication error to be permanently fatal for any service, as that's basically due to default credentials not being available at all - something I thought we'd fixed in Google.Apis.Auth 1.32.2. If you still saw it with that dependency, that's worrying, and I'd definitely like to hear about it if you see it again.

(As a complete aside, it's nice to see the Noda Time dependency :)

jskeet commented 6 years ago

Oh, hang on, I've just noticed - that's a different error. So we've now seen three errors reported in this issue:

Can I check - are you now only seeing the last of these, and is it at the same rate as you had before you updated to gRPC 1.9.0?

I wonder whether the earlier status code was misleading, and it really is "just" a deadline being exceeded. At that point, it could potentially be due to the traffic you're putting through a single channel. Different operations would report that at different times due to having different deadlines - but the initial deadline for RunQuery is a minute, which is a pretty long time.

Do you have any information about the timing of the failed calls? If they're actually failing after a minute (or the associated deadline for the operation that's failing in each case) then that would lead us in one direction - if they're failing much earlier, it's probably a different deadline (e.g. one associated with auth instead).

jskeet commented 6 years ago

One final question for now - @art-grig and @vuukle, are you both working on the same project? If this is happening in completely different projects, that would be good to know, as it suggests a more varied set of conditions under which it can occur.

vuukle commented 6 years ago

@jskeet same project, will answer your earlier question shortly

vuukle commented 6 years ago

@jskeet i did a redeploy and changed the config to 32 cpu and 4 instances this time and error rate is 100%

image

vuukle commented 6 years ago

also i get 502 bad gateway, not sure if this information helps

image

jskeet commented 6 years ago

No idea where the bad gateway error would come from, I'm afraid - but looking in the AppEngine logs might help diagnose that.

If fewer instances causes the problem to get much worse, that's suggestive of being channel-bound as I mentioned before as a possibility. That would only be seen under load - but I'd hope it should be fairly easily reproducible.

Note that just using a different DatastoreClient (or DatastoreDb) wouldn't help with that as they use a single pooled channel by default.

This should be reasonably straightforward for me to reproduce, I think - simply by running a lot of queries in parallel. If this is the problem, a workaround should be relatively straightforward, but we'll need a bit more thought to work out a long term solution.

vuukle commented 6 years ago

@jskeet increasing number of instances helps somewhat but the only two errors that are happening right this very moment are

Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Deadline Exceeded") OR Grpc.Core.RpcException: Status(StatusCode=DeadlineExceeded, Detail="Deadline Exceeded")

Does it make any difference if datastore table we are querying has 10million rows?

Please let us know if you are able to reproduce the error under load.

jskeet commented 6 years ago

Okay, so you're still seeing both errors? That's good to know (though saddening, of course).

In terms of the Datastore table size - that depends on whether it's making the queries slow and/or returning a lot of data. (The more data that's being returned, the more the channel will be a bottleneck, I suspect. Will need to verify that though.)

Do you have any information about how long the failed requests are taking?

vuukle commented 6 years ago

@jskeet can you guide me on how to check the duration of failed requests please?

jskeet commented 6 years ago

If you go into the AppEngine dashboard, select "latency" and you should be able to see the 50th/95th/99th percentile for all calls - I'd expect that to give a good first pass.

If you have Stackdriver Trace support enabled, that would probably help too - or add your own timing and logging around the problematic call. (That's what I'm going to do while I'm trying to reproduce this.)

vuukle commented 6 years ago

image

jskeet commented 6 years ago

Okay, so 5% of your calls are taking at least 7 minutes and 1% of them are taking 15 minutes.

Would you expect these calls to take a long time? Are you making many Datastore calls within each of your requests?

vuukle commented 6 years ago

@jskeet yes, we are making about 3/4 calls to data store in each HTTP request. Would that be an issue? These are mostly just fetching a few records in each datastore request

jskeet commented 6 years ago

Making 3 or 4 calls per request should be fine - but 5% of requests taking 7 minutes is pretty surprising, unless you're expecting some of them to be really slow. If you look at the same graph but include 5:05 when all the requests were failing, does the latency graph look much worse then?

vuukle commented 6 years ago

image

vuukle commented 6 years ago

@jskeet we are using memcache to store the results but there is a good chance that there will be a lot of new requests causing it to skip memcache and go directly to datastore. Is there a limit on number of concurrent connections to data store?

Other than reading from data store i dont expect these requests to be slow

jskeet commented 6 years ago

Unless you're creating new channels explicitly, you'll be using a single channel to Datastore per instance, as that's what the channel pooling does.

If you'd like me to put together a sample for how you could use multiple channels, I'd be happy to do that - it's just a matter of whether I do that before or after working on the repro locally.

If you're able to create a separate service that skips memcache and has a simplified flow, just so that we can reproduce the problem, that may be simpler - otherwise I'm going to be asking you for an awful lot of details of exactly what you're doing (what your entities and queries look like etc).

vuukle commented 6 years ago

@jskeet

  1. please let us know how to create multi channel to see if that helps
  2. i can make the service skip memcache and do a simple GET request- should take me a few minutes to do this, will update you asap
jskeet commented 6 years ago

Just to be clear, I'm not suggesting you make any change to any production service - only a test one for the purposes of diagnostics.

I'll work up a sample for multi-channel clients. I assume you're using DatastoreDb rather than just DatastoreClient - do you use the same database name for all requests?

vuukle commented 6 years ago

@jskeet Yes, i understood that part :)

Yes we are using protected readonly DatastoreDb _db;

vuukle commented 6 years ago

@jskeet adding this has stopped the errors

_db.RunQuery(q, null, Google.Api.Gax.Grpc.CallSettings.FromCallTiming(Google.Api.Gax.Grpc.CallTiming.FromTimeout(TimeSpan.FromMinutes(5)))).Entities?.AsEnumerable();

Does this information help?

vuukle commented 6 years ago

as you can see the moment i did the deploy with the above change the errors zapped

image

jskeet commented 6 years ago

That's very interesting. The default settings for RunQuery are:

CallSettings.FromCallTiming(
    CallTiming.FromRetry(new RetrySettings(
        retryBackoff: GetDefaultRetryBackoff(),
        timeoutBackoff: GetDefaultTimeoutBackoff(),
        totalExpiration: Expiration.FromTimeout(TimeSpan.FromMilliseconds(600000)),
        retryFilter: IdempotentRetryFilter
 )));

Various bits of that:

So my guess (and at the moment is that it is just a guess) is that your requests are taking more than a minute - the failures would be due to retrying the queries, but them always taking more than a minute.

I'm still going to work up the multi-channel version, as if the problem is the channel being too busy, that may well improve your latency.

If you run the same Datastore query on an otherwise idle machine, does it take long?

vuukle commented 6 years ago

when we run the query from our local machines, it never takes more than 60 seconds but yes we have seen the query sometimes taking longer than average. Can large indexes cause a problem?

We have about 20 indexes on an average 1GB per index

jskeet commented 6 years ago

I'm afraid I don't know details about Datastore itself - I work on a lot of different APIs, so don't know very much about each of them. I just make sure you can get requests from C# code to the server...

Here's a small sample for you for the pooling. You'd create an instance on start-up, then keep that instance everywhere, calling var db = pool.GetNextInstance(); each time you wanted to make a request. If you have a chance to test that out and see if it improves your latency, that would be really interesting.

using Google.Apis.Auth.OAuth2;
using Google.Cloud.Datastore.V1;
using Grpc.Auth;
using Grpc.Core;
using System.Linq;

namespace DatastoreTest
{
    // Sample code only; needs more work for productionization.

    /// <summary>
    /// Pool of DatastoreDb instances which are returned in round-robin order (they're never
    /// "returned" to the pool). Assumptions: default credentials, a single DB, and no need to
    /// release channels.
    /// </summary>
    public class RoundRobinDatastoreDbPool
    {
        private readonly DatastoreDb[] _instances;
        private int _nextIndex = 0;
        private readonly object _lock = new object();

        public RoundRobinDatastoreDbPool(string projectId, string namespaceId, int count)
        {
            var credentials = GoogleCredential.GetApplicationDefault()
                .CreateScoped(DatastoreClient.DefaultScopes) // Not necessary as of Grpc.Core 1.8
                .ToChannelCredentials();
            string host = DatastoreClient.DefaultEndpoint.Host;
            int port = DatastoreClient.DefaultEndpoint.Port;
            _instances = Enumerable.Range(0, count)
                .Select(_ => new Channel(host, port, credentials))
                .Select(channel => DatastoreClient.Create(channel))
                .Select(client => DatastoreDb.Create(projectId, namespaceId, client))
                .ToArray();
        }

        public DatastoreDb GetNextInstance()
        {
            lock (_lock)
            {
                DatastoreDb instance = _instances[_nextIndex++];
                if (_nextIndex == _instances.Length)
                {
                    _nextIndex = 0;
                }
                return instance;
            }
        }
    }
}
evildour commented 6 years ago

Not sure if it's important, but just so it doesn't go unnoticed: the original error was

Status(StatusCode=Unauthenticated, Detail="Deadline Exceeded")

But in this comment, after adding the dependency to Grpc.Core 1.9.0, @vuukle said they are now seeing

Status(StatusCode=Unavailable, Detail="Deadline Exceeded")

This is the error we're seeing with the Bigtable tests, which also have a dependency on 1.9.0. Those tests round-robin through about 15 BigtableClient instances, each with their own manually created Channel. For the Bigtable settings, the initial and max timeouts are 20 seconds each. Also, the error happens very rarely. It hasn't happened in the last 20 runs of the test.

vuukle commented 6 years ago

after increasing the time out, i didnt get any errors until i made another deploy but this time the error was different

Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Getting metadata from plugin failed with error: Exception occurred in metadata credentials plugin.")

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification at Grpc.Core.Internal.AsyncCall2.UnaryCall at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse] at Grpc.Core.Internal.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse] at Google.Cloud.Datastore.V1.Datastore.DatastoreClient.RunQuery at Google.Api.Gax.Grpc.ApiCall.<>c__DisplayClass0_02.b1 at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>cDisplayClass1_02.<WithRetry>b__0 at Google.Cloud.Datastore.V1.QueryStreamer.<Sync>d__7.MoveNext at System.Collections.Generic.List1.AddEnumerable at System.Linq.Enumerable.ToList[TSource] at Google.Cloud.Datastore.V1.LazyDatastoreQuery.GetAllResults

jskeet commented 6 years ago

@evildour: This comment suggests they're now seeing both :(

And now yet another different one... it's very hard to work out what's going on with so many different problems.

@vuukle: Has that instance failed consistently? It would be really good to know which failure mode we're looking at with each of these errors, between:

jskeet commented 6 years ago

@vuukle: Was there an InnerException in the RpcException at all, when you saw the "Getting metadata from plugin failed" error? I'd expect some more information than that.