googleapis / nodejs-datastore

Node.js client for Google Cloud Datastore: a highly-scalable NoSQL database for your web and mobile applications.
https://cloud.google.com/datastore/
Apache License 2.0
215 stars 102 forks source link

Improve cold start of Cloud Datastore for Cloud Functions #9

Closed stephenplusplus closed 4 years ago

stephenplusplus commented 6 years ago
Copied from original issue: https://github.com/GoogleCloudPlatform/google-cloud-node/issues/2374

@­richardowright
June 10, 2017 1:12 PM

Environment details

Steps to reproduce

I experience high latency (~1 to 2 seconds) with pretty much every action.

Simple example (runs through bable prior to deploy) -

static async addPerson() {
    try {
      const datastore = Datastore({
        projectId: projectId
      });
      const key = datastore.key('person');
      const person = {
        key: key,
        data: 
          [
            { name: 'last_name', value: 'Wright' },
            { name: 'last_name_UPPER', value: 'WRIGHT' },
            { name: 'first_name', value: 'Richard' },
            { name: 'first_name_UPPER', value: 'RICHARD' },
            { name: 'email', value: 'mygmail@gmail.com' },
            { name: 'address_street', value: 'My Place', excludeFromIndexes: true },
            { name: 'address_city', value: 'City' },
            { name: 'address_state', value: 'State' },
            { name: 'address_zip', value: '12345' },
            { name: 'phone', value: '123.456.7890' },
            { name: 'date_of_birth', value: new Date(1901, 02, 03)},
            { name: 'create_time', value: new Date(Date.now()), excludeFromIndexes: true }
          ]
      };

      let saveResponse = await datastore.save(person);

      let person_id=saveResponse[0].mutationResults[0].key.path[0].id;
      return person_id;
    } catch (err) {
      console.log(err);
      return;
    }
  }
lc-chrisbarton commented 6 years ago

But sure apart from the "Endpoint read failed" issue (see above ref GAPIC) we are prob. hitting some connection cache on the datastore side itself which doesn't effect AppEngine. I think that's what this ticket is about, improving how datastore handles connections from client's it appears to have never seen before. Maybe, maybe not. I think that's what's AWS is working on now with their serverless Aurora. I have worked with serverless for a couple of years now and I am certain it is the future and anything that cannot support those kind of clients will have difficulty competing,

charly37 commented 6 years ago

Just a quick update regarding my previous post with lot of timeout like:

{"name":"GoogleDataStoreTester","hostname":"gdst-3-xz6re","pid":1,"level":30,"msg":"Going to read test data - TestID: 73e12c80-09cc-4e68-af6d-78e8ebce68a2","time":"2018-01-31T16:30:00.143Z","v":0}
...
{"name":"GoogleDataStoreTester","hostname":"gdst-3-xz6re","pid":1,"level":50,"msg":"It fails. Error detail:  Error: Retry total timeout exceeded before anyresponse was received\n    at repeat (/var/www/node_modules/google-gax/lib/api_callable.js:224:18)\n    at Timeout._onTimeout (/var/www/node_modules/google-gax/lib/api_callable.js:256:13)\n    at ontimeout (timers.js:386:11)\n    at tryOnTimeout (timers.js:250:5)\n    at Timer.listOnTimeout (timers.js:214:5)  for uuid:  73e12c80-09cc-4e68-af6d-78e8ebce68a2","time":"2018-01-31T16:40:01.058Z","v":0}

So I made a simple pinger that read every minutes from GDS and let it run few hours on our servers and notice that this issue occurs 20% cases ! Then I run the same test from a VM on a cloud provider (not our infra) and notice a failure of 0%

So at the end....the lib is much more stable that I was thinking and the issue is our infra which lost 20% of message :(

lobesoftllc commented 6 years ago

require('google-cloud/datastore')(); sometimes takes more than 20 seconds (cold start) there's a way to improve the result?

GCF 30s 128MB HTTP

kgunnerud commented 6 years ago

Any updates on this? Kind of a show stopper for our project, so considering Lambda just because of this πŸ‘Ž

stephenplusplus commented 6 years ago

cc @alexander-fenster. GCF + Datastore = slow.

JustinBeckwith commented 6 years ago

I re-named this one to match what we're actually tracking here πŸ˜†

elyobo commented 6 years ago

We're seeing very slow datastore performance (very high 100s to multi seconds) in nodejs standard app engine and when using cloud shell, even after the cold start. Running the same code on my dev box has request times under 100ms and that, presumably, has a lot more network latency involved. Running sample python calls results in 100 - 200ms request times in the same cloud shell that's slow for the node lib, so it seems like it's a combination of the environment and the library.

For a simple query for something that doesn't match anything

node8 on dev box = fast (<100ms) node8 on nodejs standard = slow (800ms - many seconds, various queries) node8 on nodejs flexible = fast (10 - 20ms) python on cloud shell = fast (100 - 200ms) node8 on cloud shell = slow (1.5 - 2seconds)

Edit

Deployed the slow app from standard in to flex and ran some real world queries; things that were taking multi seconds in standard are taking 10s of milliseconds in flex. Seems like it's probably not the fault of this lib, but rather nodejs on standard (and in cloud shell); I'm not sure where else to report this to so I'll leave it here in case anyone else stumbles in here with similar issues.

kgunnerud commented 6 years ago

@l3roken : Any numbers? 100ms? 200ms? 20ms? Just for reference

l3roken commented 6 years ago

@Lg87 I was mistaken. I think I caught a lucky string of warmed up functions. Still seeing comprible numbers. I'm removing my previous comment. Moving to nodejs 8 doesn't seem to help.

lobesoftllc commented 6 years ago

@l3roken @Lg87 I just tried before/after with node6 and 8, I don't see too much difference (more cold boot on node8 & slightly slower on node8). Overall, the response time seems roughly improved (node6) since X month, I still see some spikes. Node8 seems a lot more unstable.

kgunnerud commented 5 years ago

No news on this one? It says priority:1 but no comments since 2. sep, and ages since someone from Google said anything

JustinBeckwith commented 5 years ago

πŸ‘‹ Hey there. Most of the reasons for slower module time are hard to solve, and we've been working on them piece by piece. This mostly means removing modules from our dependency chain, and we've actually seen some pretty significant results so far.

Do you have any data on cold start times specific to this library? I'd love to see that benchmark and work through it with y'all :) . Where I can't help is the overall cold start time of cloud functions themselves.

kgunnerud commented 5 years ago

@JustinBeckwith : Thanks for a rapid reply :) We put all our Cloud Function development on hold (mostly because of this and because it's not in release), so I don't have any updated performance numbers. I asked because depending on this, we might start the project up again to use Cloud Functions, at least at a trial basis.

Maybe someone else has something running that can provide some numbers?

chadbr commented 5 years ago

Could this be related to the bug I reported earlier?

https://github.com/googleapis/nodejs-datastore/issues/263

JustinBeckwith commented 5 years ago

Possibly, but I'm not thinking that's the case though. The earlier reported bug had to do with creating a new client object in a loop, each of which would have to pay startup costs for authentication. Any cold start is by definition going to require that auth cost as well. Now if we were talking warm start, it would be very possible :)

JustinBeckwith commented 5 years ago

Greetings folks! I have some good news. I think we've solved most of the issues that have been contributing to these kinds of problems. Like any perf issue, it's rarely just one thing, so I want to tell y'all what we're measuring, and throw out a few best practices.

For all of this - I am using Cloud Functions with node.js 10.x, and @google-cloud/datastore 4.x. It's very important you upgrade to the latest version of the module to pick up these improvements.

First - the code I am using to do this benchmarking:

console.log('COLD START');
const startTime = Date.now();

const startModuleLoadTime = Date.now();
const {Datastore} = require('@google-cloud/datastore');
const moduleLoadTime = Date.now() - startModuleLoadTime;

const datastore = new Datastore();

exports.main = async (req, res) => {
  const query = datastore.createQuery('Task').order('created');
  const [tasks] = await datastore.runQuery(query);
  const endTime = Date.now();
  const totalTime = endTime - startTime;
  const result = {
    tasks: tasks.length,
    moduleLoadTime,
    totalTime,
  };
  res.json(result);
  console.log('GOOD NIGHT');
  process.exit(0);
}

Tip 1

The first thing to call out, and probably the most important: make sure to require the library and instantiate the Datastore object outside of the HTTP handler.

const {Datastore} = require('@google-cloud/datastore');
const datastore = new Datastore();

exports.hello = async (req, res) => { 
  // do stuff ...
};

This is super important because the first function call in the Datastore instance will do lazy application default authentication, which can be a few networking calls in the path of your function call.

What to expect

Using this code, I am reliably seeing:

TL;DR - the worse case scenario, in the case of a cold start with just Datastore is about 1s. Subsequent calls to the warmed instance are around ~150ms.

Reducing the load time of the module

One of the biggest culprits for cold start times is going to be module load. We've been steadily working on improving that over the last year: image

You can check out the trend here, and even try out other modules. We went from a time that was hovering around ~850ms down to ~250ms.

Tip 2

One of the biggest things you can do to reduce your cold start time is to require less stuff. Every module you require in your app maps to a synchronous set of calls to load files from the disk. No matter what environment you're in ... that's going to slow you down.

What now

For now - I just wanted to get this data out there, and get a little feedback. Do folks feel like this is getting better? Anything I'm missing here?

stephenplusplus commented 4 years ago

Given the time without any reports of this issue persisting, I'm going to close as it sounds like this was fixed! If I'm wrong, please let me know so we can dig back in (in which case, the more reproduction details the better!).

ollyde commented 4 years ago

I know this is closed but here is my 2 cents.

I would pay to have an instance running in a selected area permanently that scales with extra requests just so some of those endpoints are instant that are not used a lot.

The coldboot is really painful with demoing apps and new functionality and low-used endpoints.

I'm considering abandoning cloud functions all together because of this.

chadbr commented 4 years ago

It has kept me from trying cloud functions out...

JustinBeckwith commented 4 years ago

@ollydixon this exists! The min_instances settings in App Engine let you do this sort of thing, though I recognize it isn't specifically Cloud Functions.

ollyde commented 4 years ago

@JustinBeckwith awesome! But how do I configure this with Firebase / GCP? There's no docs hehe.

For example I don't see any options to apply min_instances to export const uploadUserProfilePhotoV1 = functions.https.onCall(async (data, context): Promise<UserV1> => {}

bcjordan commented 2 years ago

In my experiments increasing the minInstances argument did not help with the datastore cold starts. Increasing the CPU allocation (via increasing the memory parameter) helped quite a bit.