FR: Record slow cloud functions / triggers and cloud function timeout

dblythy commented 3 years ago

Is your feature request related to a problem? Please describe. Although analytics aren't supported, it would be nice if we could track "inefficient" cloud functions and triggers, which could be displayed in the dashboard with a message on how to 🚀 functions and maximise efficiency.

Describe the solution you'd like It would be nice if we could specify a "timeout" value (e.g 5000), so if a function runs for longer than this, it's recorded as a slow function to be optimised, and a timeout is returned to the client from the server.

It might also help to have some sort of timing trace to be provided to help work out where the bottleneck is happening, so we can easier isolate whether performance issues are Parse Server decoding / encoding etc related, or cloud code.

Describe alternatives you've considered Manually coding into cloud functions

I've worked on a conceptual PR, that tracks the timings of function called, function decoding, cloud validator, cloud function, and response encoding times. It's not ready for merging yet, more of a concept to help understand the idea around this FR. Here is the link.. This could easily be extended to triggers as well.

Here's what a slow function looks like:

{
  params: {},
  error: { message: 'Script timed out.', code: 141 },
  events: [
    { name: '(Parse) function called', time: 0 },
    { name: '(Parse) function decoding', time: 0 },
    { name: 'cloud validator', time: 0 },
    { name: 'cloud function', time: 4794 },
    { name: '(Parse) response encoding', time: 0 }
  ],
  context: {},
  functionName: 'cloudFunction',
  timeTaken: 4794.519228000194,
  master: false,
}

In this example, a cloud function takes nearly 5s to complete, with a Parse Server timeout of 3s (slowTracking.timeout: 3000). Judging by "events", the bottleneck is all coming from the developers' cloud function.

mtrezza commented 3 years ago

Thanks for suggesting.

This is an interesting feature. There are many commercial (e.g. NewRelic) and open-source (e.g. OpenTelemetry) sophisticated app monitoring services available that provide a plethora of metrics and diagnostics. Some are also interoperable with services that monitor underlying infrastructure and clients to achieve 360° monitoring. We should keep in mind what such a feature is competing with, when we think about its viability.

Some thoughts:

if a function runs for longer than this, it's recorded as a slow function to be optimised

This could be similar to a MongoDB slow query report. Reporting slow functions out of the box could be attractive to beginning developers who do not integrate a 3rd party monitoring service. Thinking about the future of Parse Dashboard, a "Slow Query" stats page would be a great addition.

a timeout is returned to the client from the server

This seems to be a completely separate feature from "tracing", because timeout could also be function specific. This seems to ask for a separate discussion to come up with a concept that is applicable for many use cases.

It might also help to have some sort of timing trace to be provided to help work out where the bottleneck is happening

A slow Cloud Code function is still a "black box". To provide more value and enable developers to do the next step, it could be interesting to allow markers in Cloud Code. For example:

Parse.Cloud.define("myFunc", async(request) => {
   Parse.Tracing.setMarker("DB call");
   // Execute DB query

   Parse.Tracing.setMarker("send SMS");
   // Call 3rd party API
}

So that the tracing shows:

events: [
    { name: 'cloud function', time: 4794,
      markers: [
         { marker: 'a', time: 10 },
         { marker: 'b', time: 4784 },
    }
  ],

dblythy commented 3 years ago

Thanks for the suggestions. I could maybe add the timeout separately to the validator object and apply that feature from there. I agree that it is a seperate feature, I was just concerned that functions that take a long time to resolve would never show up in the "slow query" stats page as they were still resolving (is that even possible?).

I like the idea of the markers, and I agree that without them it doesn't offer much in terms of diagnosis. I think combining markers and slow queries could provide good value to developers trying to speed up their servers, without needing to look through logs or write that much new code.

mtrezza commented 3 years ago

functions that take a long time to resolve would never show up in the "slow query" stats page as they were still resolving

Not sure what you are referring to, a function can only show in the slow query report after it resolved or after nodejs server timeout.

Another aspect to consider is that monitoring that covers only Cloud Code functions captures only a fraction of Parse Server performance, or not much at all, if Cloud Code functions are not used. To give a better picture of Parse Server it may make sense to cover jobs, triggers (where often a lot of magic happens), and direct endpoint calls, or at least considering the possibility of a future extension when deciding for a concept now. That may require to rethink the concept of monitoring. A solution that is conceptually restricted to Cloud Code functions may not be viable.

dblythy commented 3 years ago

I planned on extending it to all features mentioned, I just didn't want to jump the gun and build / conceptualise it if it wasn't in line with this projects' goals. Thanks for the feedback!

dblythy commented 3 years ago

@mtrezza is it okay to overhaul some of the promise callbacks and move towards await / async? It will be easier for me to add the functionality.

mtrezza commented 3 years ago

Any code changes that are not directly related to this feature should be done in a separate PR. Also, I think we are still at the stage of a conceptual discussion.

dblythy commented 3 years ago

Ok. I was thinking of having an option for the cloud validator where you can specify slowQuery of a trigger. If the trigger exceeds the slowQuery, then all relevant information is stored in the _SlowQuery db. Such as:

 Parse.Cloud.beforeSave('cloudFunction', async (req) => {
      await hold(500);
      Parse.Cloud.setMarker('db',req);
      await hold(500);
      Parse.Cloud.setMarker('api',req);
    },{
      slowQuery : 1000
    });

It might also require a default slowQuery config option for tracking slow queries that don't go through a cloud trigger.

Ideally, I was thinking of adding in the "tracing" all the way through the function handler from the express routes, such as here. Again, I was thinking that this could help work out internal performance issues vs external, as well as picking up inefficient queries that don't necessarily depend on a trigger (e.g count).

Alternatively, all the work and tracing could be done in triggers.js, but we'd only be able to trace function start / end, as well as any custom markings. Wouldn't contain any "internal" decoding / encoding or other operations.

What are your thoughts?

lybaiyy123 commented 3 years ago

This is a great idea and function that can be quickly operated / analyzed in the control panel, it like LeanCloud. QQ截图20201207103624

mtrezza commented 2 years ago

There have been discussions around similar non-core features (analytics, endpoint rate limiting, etc). The fundamental question is always whether a proposed feature should be considered a core functionality of Parse Server and should be:

(a) in the core codebase or rather an external add-on and
(b) maintained by Parse Org

To determine (a):

How likely is it that the feature has a practical application in a professional, scaled production environment?
Are there open-source 3rd party tools that are providing the same - or maybe even more - functionality?
How likely is it that the feature will be developed beyond a PoC to reach a start-of-the-art scope?
Does it technically make sense to integrate the feature into Parse Server?

To determine (b):

Will the feature likely be of value to a critical number of users in the community (vs. 3rd party tools)?
How complex is the further development of the feature?

Btw, what @lybaiyy123's screenshot is showing is likely the MongoDB cluster stats, which is the basis for DB query optimization via indices. A server side query performance measurement seems a rather vague, indicative point to start investigations, because perf is influenced by various factors (server resources, DB resources, network latency, etc).

Moumouls commented 2 years ago

Here some thinking about this feature: https://github.com/parse-community/parse-server/pull/7867#issuecomment-1073041545

mtrezza commented 2 years ago

https://github.com/parse-community/parse-server/pull/7867#issuecomment-1073041545:

Here a good idea is may to add some interface to allow developers to plug their favorite performance tool.

I think so too. The optional adapter concept is really at the heart of Parse Server. Easily plugging a 3rd party tool into Parse Server and collecting specific Parse Server metrics that way is lightweight for parse server (low maintaining effort) and gives users the flexibility to integrate metrics into a sophisticated tool with little effort.

For example, if you add the default NewRelic driver to Node.js, it already collects useful metrics about transactional times, including the MongoDB Node.js driver, so that is a load of deep-insigt metrics that you get by adding just 1 line of code. However, these are only general metrics and a new "Parse Server Metrics Adapter" could provide additional Parse Server specific metrics to NewRelic.

But I don't want to preempt the outcome of a discussion around the questions mentioned in https://github.com/parse-community/parse-server/issues/7038#issuecomment-1073040276.

parse-github-assistant[bot] commented 2 years ago

Thanks for opening this issue!

❌ Please edit your post and use the provided template when creating a new issue. This helps everyone to understand your post better and asks for essential information to quicker review the issue.

dblythy commented 2 years ago

Thanks for your thoughts @mtrezza and @Moumouls!

Personally, I don’t see the problem with extending additional analytics to Parse Server. We love Parse Server because we can get auth integration, database and file adapters out of the box, so why can’t we add additional analytics out of the box? Sure, users can add additional authentication adapters, as many other things they can code for themselves with express. I think if Parse Server was built on the ideology of “they can get it with third party modules or code it themselves”, we wouldn’t have half of the features that we do today.

Perhaps this opens the discussion around providing appropriate documentation to direct users towards appropriate third party tools to help them understand their Parse Server queries like Parse.com could.

Moumouls commented 2 years ago

I don’t see the problem with extending additional analytics to Parse Server

Me too but i think it could be better here to see it as a new interface of Parse Server.

Create a ParseMetricsController
Use the controller into RestWirte/RestQuery, trigger hooks (we have lot of options here)
Create a simple ParseDashboardMetricsAdapter (or something else) that fit into the ParseMetricsController :)

So developers could choose to use the simple one provided by parse server or implement a new one. Database, Cache, Files, works that way, and it works well, i'm strongly in favor of keeping the clean hexagonal parse server architecture

What do you think @dblythy ? @mtrezza you can may be confirm that is the right way ?

A first simple POC could be to just add ParseMetricsController to track rest write in and out :)

dblythy commented 2 years ago

@Moumouls it seems like you want me to do more work 😉

I’m in favour of either adding analytics, or at the very least providing users instructions to how to add slow query analytics to theirs.

However I’m open to discussion and if it’s something that can be already achieved by a third party module, @mtrezza feel free to let me know and I’ll be more than happy with your support to write a blog post about isolating slow requests with Parse Server.

Moumouls commented 2 years ago

@Moumouls it seems like you want me to do more work 😉

Not intented @dblythy ahaha, your time is precious, but yes it will need some additional work to make it following the hexagonal architecture that allowed Parse Server to be maintainable and evolutive since 2016. (i'm convinced that we need to stick to the controller/adapter system 🙂 )

Here some example on how to setup Sentry Express tracing. https://docs.sentry.io/platforms/node/guides/express/performance/

mtrezza commented 2 years ago

However I’m open to discussion and if it’s something that can be already achieved by a third party module, @mtrezza feel free to let me know and I’ll be more than happy with your support to write a blog post about isolating slow requests with Parse Server.

You may want to try out NewRelic's free plan. It will give you insight and extensibility on a level that a Parse Server feature is unlikely to ever achieve. Just consider making sense of the data that your PR writes requires metrics management, display, indexing and aggregation to make metrics queries efficient. That is essentially a product in itself. That's why these metrics are usually stored in a separate DB, as writing and reading metrics can have a significant DB impact. Also it's a data island: how do you relate the data to DB metrics or other infrastructure metrics? I think a PoC is an interesting experiment, but to get this to a practical solution that has application in a production environment I think the gap is quite a big one to bridge.

An NewRelic adapter for Parse Server could incorporate these settings, so it's even easier to set this up with Parse Server, maybe by only providing the NewRelic credentials:

https://stackoverflow.com/questions/62238021/how-to-track-class-names-of-parse-server-with-new-relic

And such an adapter / controller could provide methods to track custom metrics and also define the interface for Parse Dashboard for a metrics page. Now, if you want to write a custom adapter that stores everything in the Parse Server DB as your PR currently suggests, then that would be just another 3rd party adapter, but I would not bake this concept into the core of Parse Server. We already have some things in the core that should not be there and are difficult to get out, like the GraphQL API, and there is already a discussion going on about moving that out of the core.

I like @Moumouls's suggestion, and I think it can be developed in stages. The interface itself is a bit additional work, but I think it will also help you to find a solid structure for your custom adapter as a side benefit.

dblythy commented 2 years ago

Thanks for your thoughts @mtrezza. Quite the sales pitch for NewRelic 😅.

Perhaps this is a good opportunity to leverage growth in the best practises guides. If I, as someone who is rather experienced in the Parse Server field struggles with this, I can only image how difficult it is for new users.

mtrezza commented 2 years ago

I mentioned NewRelic as an example, but there are other popular (open-source) products out there like OpenTelemetry or netdata that offer similar functionality. I mention these so we have something to compare, and to make the point that there are sophisticated monitoring products with a huge feature scope out there. Integrating these into Parse Server via adapter/controller will probably bring a much higher value, especially for devops who are already monitoring, than developing a much more simpler version from scratch that is unlikely to ever catch up to existing products.

dblythy commented 2 years ago

parse-community / parse-server

FR: Record slow cloud functions / triggers and cloud function timeout #7038

Thanks for opening this issue!