Improve cold start of Cloud Datastore for Cloud Functions

stephenplusplus commented 6 years ago

Copied from original issue: https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2374

@richardowright
June 10, 2017 1:12 PM

Environment details

OS: Google Cloud Functions
Region: US Central
google-cloud-node version: 1.0.2

Steps to reproduce

I experience high latency (~1 to 2 seconds) with pretty much every action.

Simple example (runs through bable prior to deploy) -

static async addPerson() {
    try {
      const datastore = Datastore({
        projectId: projectId
      });
      const key = datastore.key('person');
      const person = {
        key: key,
        data: 
          [
            { name: 'last_name', value: 'Wright' },
            { name: 'last_name_UPPER', value: 'WRIGHT' },
            { name: 'first_name', value: 'Richard' },
            { name: 'first_name_UPPER', value: 'RICHARD' },
            { name: 'email', value: 'mygmail@gmail.com' },
            { name: 'address_street', value: 'My Place', excludeFromIndexes: true },
            { name: 'address_city', value: 'City' },
            { name: 'address_state', value: 'State' },
            { name: 'address_zip', value: '12345' },
            { name: 'phone', value: '123.456.7890' },
            { name: 'date_of_birth', value: new Date(1901, 02, 03)},
            { name: 'create_time', value: new Date(Date.now()), excludeFromIndexes: true }
          ]
      };

      let saveResponse = await datastore.save(person);

      let person_id=saveResponse[0].mutationResults[0].key.path[0].id;
      return person_id;
    } catch (err) {
      console.log(err);
      return;
    }
  }

stephenplusplus commented 6 years ago

June 10, 2017 1:26 PM

Can you see if subsequent calls also have a lag? The first request has to sort out the authentication details, so it's known to be slower.

stephenplusplus commented 6 years ago

@richardowright
June 10, 2017 2:20 PM

If I call the datastore API within the same function call, I do see a reduction in subsequent call time -

db_add_person: 1409.076ms
5668753656250368
db_add_person: 363.180ms
5645784439586816
db_add_person: 235.747ms
5637489247125504

I've also discovered that using a key file (i.e. creating a service account and specifying keyFilename) decreases the first call time from ~1.4 seconds to 0.85 s.

So, it looks like your suggestion that it's related to the first request auth is correct, but, since when using cloud functions, the penalty is incurred on each http invocation the overhead is still very high (i.e. 0.8s per HTTP call).

stephenplusplus commented 6 years ago

June 12, 2017 12:40 PM

since when using cloud functions, the penalty is incurred on each http invocation the overhead is still very high (i.e. 0.8s per HTTP call).

Yes, unfortunately. I'll close the issue, as we can't do anything within the client library. If any ideas come to mind for how this can be addressed, feel free to discuss.

stephenplusplus commented 6 years ago

@ro-savage
August 7, 2017 10:45 AM

@stephenplusplus - We are running into the same issue. Datastore and Functions seem like the perfect combination, paying only a per-request cost. However the latency is much too high to be usable in real-world.

As was mentioned, it is related to the Auth.

This problem must be solvable, as other similar systems such as Lambda + DynamoDB don't under into this issue.

It seems unlikely to be implemented in node, but probably some setup on the function/lambda that is 'pre-authenticated' (AWS IAM roles?).

Who would be the correct team/repo to work on this issue?

stephenplusplus commented 6 years ago

@reco
August 28, 2017 3:36 PM

@ro-savage we were/are using app engine with datastore. we also did run in to big issues with the latency of the datastore. would love to hear some updates @google-admin about this issue.

stephenplusplus commented 6 years ago

@ahume
September 26, 2017 11:05 PM

A few weeks into development, we're finding Datastore to not be a viable solution for use in Functions due to the consistently poor performance on even very simple queries. 4000ms on first (authenticating) request, and then 800-1200ms afterwards.

stephenplusplus commented 6 years ago

@ignaciovazquez
October 3, 2017 4:44 AM

We are also experiencing 60ms-200ms latencies on a lookup by key (runQuery is even worse) from GAE Flexible in node.

stephenplusplus commented 6 years ago

November 6, 2017 5:07 PM

Also reported in #2727

@lukesneeringer this seems to be a common problem. The most recent report shows 10 seconds spent on a query that should take in the tens of milliseconds. ~~Worth considering, our Datastore API uses the REST endpoint~~ (correction: our Datastore API uses common-grpc / manual proto loading with gRPC). There is a GAPIC layer that we will likely convert to during our repo extraction.

stephenplusplus commented 6 years ago

@EJB3927
November 6, 2017 5:08 PM

Thank you @stephenplusplus. I can confirm I'm having this issue with the following environment:

OS: macOS Sierra
Node.js version: 7.10.0
npm version: 4.2.0
google-cloud/datastore version: 1.1.0

Additionally, I'm noticing the issue on Cloud Functions with the same node and NPM versions.

stephenplusplus commented 6 years ago

@lukesneeringer
November 6, 2017 9:18 PM

If this is something that is fixed by switching to the gRPC transport we use elsewhere, then this should be a reasonably (~week or so) quick fix.

stephenplusplus commented 6 years ago

@ChrisBartonLC
November 8, 2017 4:47 PM

We also have the same high latency issues mentioned by @ahume with CloudFunctions and Datastore node.js client v1.1.0

stephenplusplus commented 6 years ago

@ChrisBartonLC
November 8, 2017 4:57 PM

I guess if you convert to gRPC then when used in a CloudFunction the Datastore comms will start having this problem like Spanner and PubSub #2438. If that happens then at least the problem will be more visible.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 7, 2017 2:25 AM

I get a significant reduction in latency in calls from CloudFunctions to Datastore by embedding a service account. For warm functions latency was previously several seconds (mostly around 2-3 seconds but sometimes much higher) and are now mostly 1 second with the occasional 2 seconds. These calls are a mixture of CRUD operations.

FYI I am using AppEngine cron to trigger a job every minute to hit all functions several times in order to keep them warm. We only have indexes on a small subset of attributes in the Datastore entities.

We have around 60 functions in all. Every function has a health mode where it immediately returns instead of calling Datastore and the latency of the functions when operating in this mode are around 200-400ms with the occasional 500-600ms.

Old code (without the 'at' prefix): client = { datastore: require('google-cloud/datastore')() }

New code (without the 'at' prefix): require('google-cloud/datastore')({ credentials: service_account })

The service account is encrypted with GCP KMS so we load it and cache it prior to the function running so basically wrapping the entry point in a promise.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 7, 2017 2:35 AM

Does anyone know what the expected latency of Datastore is? We are considering moving to Go and AppEngine (which we used before with AWS and DynamoDB) or will that have the same issues? Another person (@ignaciovazquez) in this thread mentioned that AppEngine and Datastore also have issues but perhaps that is just the Node.js lib? Would it be better for us to look at using CloudSQL if we want low latency?

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 10, 2017 4:10 AM

Today I deployed our code to app engine flex using the same node.js version used by gcp functions. In both cases (same code) auth is an embedded service account decrypted (and cached) using GCP KMS.

Pings average around 250ms compared to 350ms in functions. However the latency is more erratic from functions with a % of responses falling into the 500-1000ms window. So when not calling out to other integrations, functions are only slightly slower but more erratic.

When the code also reads from Datastore there is a more pronounced difference in mean latency between app engine 589ms and functions 924ms. Function latency is also much more erratic e.g. a % of requests >> 1000ms.

Data is attached.

perf-appengine.txt perf-fn.txt

stephenplusplus commented 6 years ago

@ro-savage
December 10, 2017 6:15 AM

Just at side note to maintainers / google. We raised this issue in a few places including in person at Google Sydney and pretty much were told there is no solution and no eta.

We ended up leaving google cloud and moving to AWS. Not only was it must faster but it costs significantly less because all our functions run in under 100ms on AWS with Dynamo compared to multiple hundreds (sometimes 1-2s) with Google Cloud.

This issue is one of the things that kills the ability for people to use Cloud Functions and Datastore together on a production app.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 10, 2017 6:07 PM

Thanks @ro-savage I was coming to the same conclusion and it is not just functions, latency when using AppEngine Flex and Node.js (but not Go) is not that great either. To be fair Datastore and DynamoDB are not drop-in replacements because Datastore has transactions which are occasionally very very useful and help keep the code clean because you do not need to cope with in-consistent data where that can happen. DynamoDB is very very fast since it's basically key-value and has proper support for updating documents unlike Datastore where an update replaces the entire entity from the point of view of the developer. On the other hand Datastore supports many more query indexes. I have used both extensively and would prefer Datastore if it were faster. Lambda is of course far ahead of Functions which is unsurprising given the latter is beta. AWS support is incredible, I get AWS engineers phoning me to ask how they can help out. Firebase hosting and auth etc are great though, much better than CloudFront and S3 and Cognito. So both Clouds have strengths with AWS being flexible, more work and fast and GCP being easy to use, less ops overhead and fast for most things too.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 11, 2017 3:54 AM

Just did the same tests but using App Engine Flex with Go 1.9.2 instead of Node.js (App Engine Flex and Functions) and latency for pings average around 240ms and reads 390ms. Both the pings and the read latency are much less erratic compared to Node.js (App Engine Flex and Functions). We are the other side of the world from the infra (but still on fiber) so it's the delta between the ping and read that is important not the absolute number. So we will stick with Datastore but just use Go and App Engine Flex.

FYI when using Node.js it is easy to support both Functions and Node.js App Engine Flex as deployment targets, if it wasn't I wouldn't have done this testing in the first place. GCP NoOps is very addictive...

Data is attached. perf-appengine-go.txt

stephenplusplus commented 6 years ago

December 11, 2017 8:33 PM

I'm marking this as blocked as there are 2 places I can see that could be responsible:

1) Cloud Functions. Something in the environment doesn't interact quickly with something in the callstack. I would guess the auth layer (google-auth-library) or the transport layer (grpc). 2) Our library needs to use the GAPIC API. Currently, our library communicates with Datastore through the transport layer, gRPC, directly. gRPC exposes many knobs for configuration, but we are using the defaults that it provides. The GAPIC API stands between us and gRPC and changes some of those settings.

We are almost switched over to the GAPIC API. We are working on the final steps of splitting the Datastore code out from this repo and into its own. After that, it would be great if anyone with patience remaining could test it out for us.

If it doesn't have any effect, we will need to make sure this feedback gets noticed by the Cloud Functions team and we start a proper search for the problem.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 11, 2017 8:38 PM

Happy to test when you're ready

stephenplusplus commented 6 years ago

@richardowright
December 11, 2017 8:44 PM

Thanks. A few things I've found that might be relevant findings to help diagnose -

the first call from within the same function execution is slow; if I save multiple times in the same execution/call of/to the function, the time drops
using a key file (i.e. creating a service account and specifying keyFilename) decreases the first call time from ~1.4 seconds to 0.85 s.
increasing the memory usage of the cloud function has a significant impact on timing

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 11, 2017 9:01 PM

We use --memory=1024MB in order to get a faster machine (as recommended in the docs). We haven't tired any higher memory settings.

We see similar behaviour to @richardowright and its important to note that the mean latency is not the only problem; some requests can still take multiple seconds even when auth is hardcoded to give a lower mean latency. Similar behaviour is seen in App Engine Flex with Node.js i.e. some requests take > 1 sec even though the mean for reads is lower ~ 600ms.

stephenplusplus commented 6 years ago

@ChrisBartonLC
December 14, 2017 2:06 AM

I just did the same test against DynamoDB using a 1024MB Lambda behind API Gateway using Go and eawsy's aws-lambda-go-shim and the latency was ~490ms and pretty stable. This is all very rough and ready but seems there is no big difference between Go/AppEngine/Datastore compared to Go/API Gateway/Lambda/DynamoDB and so once Cloud Functions gets Go support I will be happy :)

perf-lambda-go.txt

stephenplusplus commented 6 years ago

Hey all, we have just released the new Datastore using GAPIC-- please test it out!

lc-chrisbarton commented 6 years ago

OK still using a cached service account (will try without tomorrow) I tested a set of 20 CRUD functions using v1.2 of the datastore client:

TLDR: Performance similar to Node.js in AppEngine Flex but Go in AppEngine is still quicker and more consistent.

1) Ran postman integration tests 3 times against our current functions, which use v1.1 of the datastore client. 2) Deleted all our current functions. 3) Updated the package.json (v1.2 of the datastore client) and re-generated the package-lock.json then committed these two files. 4) The new commit fired off a build which deployed all functions. 5) Ran postman integration tests 3 times against the new functions, which use v1.2 of the datastore client.

See attached postman files for the raw data before and after, rename them to .json to load into postman. 'totalRequestTime' stores the latency. BTW ignore the 'config/country' and 'documents', they call out to geonames or generate pdfs. Oh and obviously ignore calls to 'https://postman-echo.com/delay/10' which are used when necessary to cope with eventual consistency.

lc-api-customer.postman_test_run_21_07_WithDatastore1_2.txt lc-api-customer.postman_test_run_21_05_WithDatastore1_2.txt lc-api-customer.postman_test_run_21_03_WithDatastore1_2.txt lc-api-customer.postman_test_run_19_54_WithDatastore1_1.txt lc-api-customer.postman_test_run_19_57_WithDatastore1_1.txt lc-api-customer.postman_test_run_19_59_WithDatastore1_1.txt

ghost commented 6 years ago

Thanks Stephen!

My initial tests with this look much better. The total call times are averaging about to 140 ms for my cases.

lc-chrisbarton commented 6 years ago

Well I did some more tests this morning and although latency can be low it is not consistently low. Running around 50 different CRUD operations via Postman: 2/3 average 750ms (non CRUD health checks are around 600ms so confirms what @richardowright sees) but 1/3 of the requests are between 1000-1500ms and a few even higher e.g. one api was 5468ms during one run but 1176ms during another run

lc-chrisbarton commented 6 years ago

I guess this issue needs to change to be a bug now because this unresolved issue from July (2489, 2728, 2438) is now frequently occurring with datastore client 1.2:

{ insertId: "000000-153ac460-572b-465b-a85b-79e140656175"
labels: {…}
logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"
receiveTimestamp: "2017-12-16T20:20:38.383564358Z"
resource: {…}
severity: "ERROR"
textPayload: "{ Error: Stream removed at /user_code/node_modules/@google-cloud/datastore/node_modules/grpc/src/client.js:554:15 code: 2, metadata: Metadata { _internal_repr: {} }, note: 'Exception occurred in retry method that was not classified as transient' }"
timestamp: "2017-12-16T20:20:29.146Z"
}

I found this issue after doing 5 runs of our integration test suite which is comprised of around 50 CRUD APIs.

FYI I mentioned this possibility above: "I guess if you convert to gRPC [sic should read GAPIC API] then when used in a CloudFunction the Datastore comms will start having this problem like Spanner and PubSub #2438. If that happens then at least the problem will be more visible."

References https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2489 https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2438 https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2728

lc-chrisbarton commented 6 years ago

We have a separate repo which has around 30 functions which expose around 100 CRUD APIs and the Postman test suite is failing there too with the same error after the upgrade to v1.2:

{ insertId: "000000-51a5fbb5-105b-46d8-b47a-c84779872916"
labels: {…}
logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"
receiveTimestamp: "2017-12-16T20:55:19.488940692Z"
resource: {…}
severity: "ERROR"
textPayload: "{ Error: Stream removed at /user_code/node_modules/@google-cloud/datastore/node_modules/grpc/src/client.js:554:15 code: 2, metadata: Metadata { _internal_repr: {} }, note: 'Exception occurred in retry method that was not classified as transient' } '{"code":2,"metadata":{"_internal_repr":{}},"note":"Exception occurred in retry method that was not classified as transient"}'"
timestamp: "2017-12-16T20:55:10.179Z"
}

lc-chrisbarton commented 6 years ago

Fix is to pin to "1.1.0" of the datastore client just like we have to pin to "0.14.0" of the PubSub client, see https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2728 but I believe this is really from July and https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2489.

I believe the real fix for this issue is better testing around Google Function integration since we found it and other issues (Spanner and PubSub) pretty quickly with our integrations tests. However perhaps this is out of scope since Cloud Functions is a beta product.

lc-chrisbarton commented 6 years ago

Also see this bug report 'Lots of "Could not load the default credentials" showing up' in https://groups.google.com/forum/#!forum/cloud-functions-beta-testers

lc-chrisbarton commented 6 years ago

Something also seems to have happened to the firebase admin API:

{ insertId: "000000-5f365da9-2d58-4ce6-9eef-8e6a3c259fe5"
labels: {…}
logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"
receiveTimestamp: "2017-12-18T03:54:35.709677164Z"
resource: {…}
severity: "ERROR"
textPayload: "{ Error: Stream removed at /user_code/node_modules/firebase-admin/node_modules/grpc/src/client.js:554:15 code: 2, metadata: Metadata { _internal_repr: {} }, note: 'Exception occurred in retry method that was not classified as transient' }"
timestamp: "2017-12-18T03:54:29.462Z"
}

lc-chrisbarton commented 6 years ago

https://github.com/firebase/firebase-admin-node/releases

Is this relevant from v5.4.3 since firestore is datastore under the hood?:

"Fixed a regression in module loading that prevented using the Admin SDK in environments like AWS Lambda. This regression was introduced in the 5.4.0 release, which added a new dependency to Firestore and gRPC. This fix lazily loads Firestore and gRPC, thus enabling Admin SDK usage in the affected environments as long as no explicit attempts are made to use the Firestore API."

lazharichir commented 6 years ago

Facing the very same issue with Google Cloud Functions and Firestore... First call takes several seconds, generally from 3 to 6 seconds. Subsequent calls half the latency. I have separate Functions talking to a MongoDB VM in Compute Engine, and the latency is < 1s, except on the first call where it's around 3-5 seconds.

I've moved some functions to AWS Lambda with DynamoDB where the latency is at most 1.5s, which is definitely doable in an end-user facing situation.... Sad.

lc-chrisbarton commented 6 years ago

It is not all doom and gloom for GCP though, we've moved from Cloud Functions to App Engine Flex with Go and our apis using Datastore are consistently fast with little variability:

→ Create Order POST https://cover.dev.localcover.cc/cover/v1/orders [201 Created, 460B, 383ms] ┌ │ 'received body', { ... } └

→ Read Cover GET https://cover.dev.localcover.cc/cover/v1/covers/3b...a772c [200 OK, 1.84KB, 359ms] ┌ │ 'received body', { ... } └

lc-chrisbarton commented 6 years ago

FYI it's quite simple and quick to deploy a group of Cloud Functions to App Engine Flex while we wait for performance improvements in Cloud Functions, it's a beta product after all. Our tests show good performance after doing that < 1s, however not as good as Go and with much more variability.

lazharichir commented 6 years ago

The problem being I do not code in Go and App Engine would defeat the point of serverless infrastructure. The main issue is not with Cloud Functions, but rather with Firestore and Datastore. These to are slow, Google Func is definitely fine since it works perfectly with MongoDB instances.

michaelmior commented 6 years ago

Not sure if this is related, but I'm also seeing several seconds of added latency with a Dialogflow app using Google Cloud Functions and Firebase Realtime Database. On the order of 5+ seconds for a cold start.

lc-chrisbarton commented 6 years ago

Well App Engine Flex is serverless in-so-far-as the machines are managed by Google and so there is no infra to worry about but yes it is not truely abstracted away because one needs to monitor how the scaling is working. It is also a lot more expensive. Serverless is defn the way to go, especially now AWS are beta testing low latency connectivity to relational databases via Lambda. I take your point about Mongo. My point about App Engine was really to show that there is nothing intrinsically slow about using Datastore so the issue is when it is used from a Function. There are also some performance issues when used from JavaScript because even in App Engine the JS performance is 2x slower compared to Go. JS latency compared to Go is also very variable which is just as important. I guess it is a combination of the Client library and establishing the initial connection.

lazharichir commented 6 years ago

Yup, and for my ambitious project, I am alone developing dozens of microservices and doing sales and marketing, so caring about servers and infrastructure is a no-go. Hence the serverless and cloud functions requirement.

WIll evaluate options and see what steps to take. Thanks @ChrisBartonLC for the input! App Engine will run my api gateway so yes, love it!

lc-chrisbarton commented 6 years ago

One could argue that DynamoDB is serverless storage in the same manner that App Engine is serverless compute since you need to monitor how the scaling is working. Their new checkboxes for auto-scaling are similar to the option one provides in app.yaml. I know DDB well, we used to have 40 microservices each using it as their backend. Hopefully there is no architectural reason why the first connection to datastore needs to be slow. If there is then it could be a long time before the issue gets solved, if it ever does.

lazharichir commented 6 years ago

@ChrisBartonLC haven't yet properly tested DynamoDB nor AWS Lambda to make a final call and to be honest, a core service I need is Google Natural Language and at that, Google is the best. So ideally, I would love to keep it all in the same ecosystem but Google team is very silent about these latency issues despite these being raised on Cloud Functions google group, Firestore google group, Datastore google group, Firebase groups, Stack Overflow, Github, etc.

It's not a good sign unfortunately...

lc-chrisbarton commented 6 years ago

@lazharichir DDB is fast and supports proper updates and has good querying capabilities, excellent filtering and conditional updates (which are very useful). Datastore is not as fast but fast enough for us, doesn't support updates properly (you need to rewrite the entire row) and conditional updates require the use of a transaction which is basically slower than the DDB approach but easier to use. In Datastore you can query on as many indexes as you like but filtering is more limited. However the killer datastore feature is transactions e.g. create an order and issue a product as an atomic action. When doing something similar in DDB the code is more complicated and one needs scheduled cleanup logic looking for orphans.

We tried Spanner for a couple of months and it is better than both DDB and Datastore IMO but it is far too expensive to use for DEV/TEST (costs are fine for production) the real killer for us was the initial connection being too slow. Frankly serverless Aurora on PostgreSQL is the prefect solution for us but it is not available yet. Having said that we would prefer Spanner if it were cheap to run in DEV/TEST and like Datastore they fix the initial connection latency.

lazharichir commented 6 years ago

@ChrisBartonLC yup, likewise... my ideal database is MongoDB in terms of querying so I've got it for some key microservices but I need something that requires zero maintenance and just works, even if the tradeoff is less querying abilities (e.g. Firestore, and datastore that I had beforehand).

However, with such latency, the tradeoff becomes problematic for real-life scenario in production... Especially when the most demanding services work with MongoDB, do way more processing, yet return way faster than Firestore/Datastore.

Let's just hope Google Cloud / Firebase teams fix these issues asap. I guess Functions are still betas, Firestore too, and I think Datastore may soon be replaced by Firestore for new users.... so let's give it some time...

lc-chrisbarton commented 6 years ago

Yep that's what we are planning, mainly because the other killer feature of GCP is that we can focus on the product much more instead of config management. If anything that seems the GCP philosophy and so I believe using a function will become their preferred method for calling all their services and hence they will make it work at almost any cost.

charly37 commented 6 years ago

I face it from time to time. The RT will increase and even timeout. Here is the code i use:

var datastore = require('@google-cloud/datastore')({
    projectId: 'XXXXXXXXX',
    keyFilename: 'XXXXXXXXXX'
});
var query = datastore.createQuery('XXXXXXX');
var caliQuery = query.filter('_XXXXXX', '=', XXXXXXX);
log.info('Going to querry Google DataStore Service with uuid: ', aUuid);
var aReturnPromise = datastore.runQuery(caliQuery)
  .then((results) => { 
    log.info('findOne Google DataStore Over', results, 'for uuid: ', aUuid);
    const tasks = results[0];
    if (tasks.length>0) {
      log.info('resolve');
      resolve(tasks[0]);
    }
    else {
      log.info('We did not find the user in GDS')
      resolve(null)
    }
  })
  .catch(function (err) {
    log.error('Error with Going to querry Google DataStore Service. Error detail: ', err, ' for uuid: ', aUuid);
    reject(Error('Error with Going to querry Google DataStore Service. Error detail: ', err));
  });

and a log example of a failure leading to RTO after 10 min

{"name":"XXXX","hostname":"XXXXXXX-1-fb1jk","pid":1,"level":30,"msg":"Going to querry Google DataStore Service for XXXXXXX with uuid:  6f894fff-7de6-4186-8752-d1315ff2123b","time":"2018-01-18T19:37:44.338Z","v":0}
...
{"name":"XXXXX","hostname":"XXXXXXX-1-fb1jk","pid":1,"level":50,"msg":"Error with Going to querry Google DataStore Service. Error detail:  Error: Retry total timeout exceeded before anyresponse was received\n    at repeat (/var/www/WebServerNodeJs/node_modules/google-gax/lib/api_callable.js:224:18)\n    at Timeout._onTimeout (/var/www/WebServerNodeJs/node_modules/google-gax/lib/api_callable.js:256:13)\n    at ontimeout (timers.js:386:11)\n    at tryOnTimeout (timers.js:250:5)\n    at Timer.listOnTimeout (timers.js:214:5)  for uuid:  6f894fff-7de6-4186-8752-d1315ff2123b","time":"2018-01-18T19:47:44.834Z","v":0}

So in the worst case it TO after 10 min. This happen randomly (the previous/next call are working fine). I will try to see if we can migrate to firestore and hope it work better.

plaisted commented 6 years ago

To give a different perspective, I have a Cloud Function using the node Datastore library. The function receives a pub / sub message, runs a Datastore query based on the message, creates some complex objects, and stores two entities in the Datastore using a transaction. Typical completion times after "warmup" pasted from logs:

Function execution took 300 ms, finished with status: 'ok'
Function execution took 329 ms, finished with status: 'ok' 
Function execution took 495 ms, finished with status: 'ok'
Function execution took 245 ms, finished with status: 'ok' 
Function execution took 254 ms, finished with status: 'ok'
Function execution took 286 ms, finished with status: 'ok'
Function execution took 294 ms, finished with status: 'ok'
Function execution took 390 ms, finished with status: 'ok'

Not amazing response times, running a similar setup with postgres / c# or Go I'd expect something well under 100 ms but considering this is essentially ops free, billed per use, and scales essentially endlessly I'm pretty happy. I will note creating the Datastore object tends to take a while so it needs to be re-used.

lc-chrisbarton commented 6 years ago

We do not have a problem with datastore in AppEngine using Go. Performance has improved over the last 6 months this issue has been opened (https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2374) and I'm sure it will improve more before functions go GA (all this stuff is beta). Main issues for us (apart from occasional crashes due to "Endpoint read failed") is that latency is fairly variable between requests and over the period of hours. Below is a recent integration test from CircleCI showing very good performance sometimes. All the code is pretty much the same (in one repo) and often uses tx. Health checks don't hit datastore. We have even deployed the same JS code to AppEngine and get more consistent results.

→ Health - Auth Token GET [200 OK, 1.02KB, 177ms]

→ Health - Config Country GET [200 OK, 1.02KB, 198ms]

→ Health - Profile GET [200 OK, 1.01KB, 307ms]

→ Health - Profile Addresses GET [200 OK, 1.02KB, 171ms]

→ Health - Profile Addresses By Id GET [200 OK, 1.03KB, 175ms]

→ Health - Profile Emails GET [200 OK, 1.02KB, 254ms]

→ Health - Profile Emails By Id GET [200 OK, 1.03KB, 114ms]

→ Health - Profile Phone Numbers GET [200 OK, 1.03KB, 172ms]

→ Health - Profile Phone Numbers By Id GET [200 OK, 1.04KB, 174ms]

→ Health - Claims GET [200 OK, 1.01KB, 165ms]

→ Health - Cover GET [200 OK, 1.01KB, 276ms]

→ Health - Cover By Id GET [200 OK, 1.02KB, 178ms]

→ Health - Cover Meta-data GET [200 OK, 1.02KB, 195ms]

→ Health - Cover By Id Documents GET [200 OK, 1.03KB, 191ms]

→ Health - Cover By Id Product Character GET [200 OK, 1.04KB, 310ms]

→ Health - Cover By Id Claims GET [200 OK, 1.03KB, 258ms]

→ Health - Cover By Id Claims By Id GET [200 OK, 1.04KB, 244ms]

→ Health - Cover By Id Claims By Id Status GET [200 OK, 1.05KB, 89ms]

→ Login Customer POST [200 OK, 1.96KB, 169ms]

→ Login CSR POST [200 OK, 1.96KB, 158ms]

→ Auth Token GET [200 OK, 1.79KB, 1312ms]

→ Config Country - Read GET [200 OK, 97.67KB, 1930ms]

→ Config Country - Read By Code GET [200 OK, 1.34KB, 590ms]

→ Profile - Read GET [200 OK, 1.91KB, 1051ms]

→ Profile - Update PUT [204 No Content, 957B, 780ms]

→ Profile - Update - Name PUT [204 No Content, 953B, 510ms]

→ Profile Address - Create POST [201 Created, 1014B, 990ms]

→ Profile Address - Read by Id GET [200 OK, 1.41KB, 1007ms]

→ Profile Address - Update by Id PUT [204 No Content, 953B, 623ms]

→ Profile Email - Create POST [201 Created, 1018B, 877ms]

→ Profile Email - Create Expecting Conflict POST [409 Conflict, 968B, 457ms]

→ Profile Email - Read by Id GET [200 OK, 1.11KB, 975ms]

→ Profile Email - Issue Verification Token PUT [204 No Content, 953B, 353ms]

→ Profile Phone Number - Create POST [201 Created, 1018B, 980ms]

→ Profile Phone Number - Create Expecting Conflict POST [409 Conflict, 972B, 611ms]

→ Profile Phone Number - Read by Id GET [200 OK, 1.13KB, 925ms]

→ Cover - Read Meta-data GET [200 OK, 1.05KB, 812ms]

→ Cover - Create POST [201 Created, 1014B, 1860ms]

→ Cover By Id - Read GET [200 OK, 2.07KB, 864ms]

→ Cover By Id - Update - product_serial_number PUT [204 No Content, 953B, 709ms]

→ Cover By Id - Update - customer_approved PUT [204 No Content, 953B, 622ms]

→ Cover By Id - Update - customer_rejected PUT [204 No Content, 953B, 637ms]

→ Cover By Id Product Character - Read GET [200 OK, 1.47KB, 1595ms]

→ Cover By Id Claims - Create POST [201 Created, 1014B, 2.1s]

→ Cover By Id Claims By Id - Read GET [200 OK, 1.81KB, 1019ms]

→ Profile Phone Number - Read All GET [200 OK, 1.29KB, 526ms]

→ Proifle Address - Read All GET [200 OK, 1.75KB, 466ms]

→ Proifle Email - Read All GET [200 OK, 1.25KB, 460ms]

→ Cover - Read All GET [200 OK, 15.3KB, 793ms]

→ Cover By Id Claims - Read All GET [200 OK, 1.83KB, 503ms]

→ Claims - Read All GET [200 OK, 1.83KB, 1151ms]

→ Cover By Id Claims By Id - Delete DELETE [204 No Content, 957B, 594ms]

→ Cover By Id - Delete DELETE [204 No Content, 957B, 593ms]

→ Profile Address - Delete by Id DELETE [204 No Content, 957B, 692ms]

→ Profile Email - Delete by Id DELETE [204 No Content, 957B, 441ms]

→ Profile Phone Number - Delete by Id DELETE [204 No Content, 957B, 871ms]

plaisted commented 6 years ago

@ChrisBartonLC

Could be from added latency due to "cold" functions if they vary that much. I think all the serverless implementations suffer from this currently. Would occur both when a function hasn't been used in a while (maybe the case for integration tests) or when usage spikes and additional concurrent workers are spun up.

lc-chrisbarton commented 6 years ago

See above comments, when posting earlier to this issue we ran several tests. The integration test also starts with a health check for each function. In addition all functions are continuously hit several times a minute using an AppEngine cron job - cron job starts every minute and issues in parallel several simultaneous requests to each function and repeats this sequence 3 times. We also cache the datastore client and we use a cached service account. Have you tried testing util you manage to start getting crashes due to "Endpoint read failed" due to a warm function being moved to another machine?

googleapis / nodejs-datastore