[QUERY] Could the function execution timeout be increased or configurable?

tomas-zijdemans-vipps commented 1 year ago

Question

I often run into the "error: function execution exceeded 15.0s timeout" issue, where, if a function takes more time than 15s - the workflow will stop.

For various functions, the execution time will be more than 15 seconds, and breaking the function into smaller parts gets quite ugly and complicated.

Could this timeout please be configurable?

filmaj commented 1 year ago

We are having many discussions on this topic internally. Let me state my view of this.

First, it's important to know why Slack put into place these limitations to begin with. It boils down to user experience. When an end-user does something in the Slack client, we want them to get some feedback that something happened within a reasonable timeframe. That's why in the original release of interactivity support for Slack apps, we almost always required a response to the interactivity event within 3 seconds.

Just increasing the timeout doesn't address this (quite reasonable, IMO) constraint.

I think what you're asking for, @tomas-zijdemans-vipps , is an ability to run long-running compute tasks. Increasing the timeout addresses this, but perhaps a different way of solving this would be a combination of:

easier way of modularizing your application into functions, and
being able to invoke these functions from each other, and
differentiating between a function that is responding to user interactions vs. functions that are not

At least, that's how I would think about.

Are you able to share your use case @tomas-zijdemans-vipps ? The more developer use cases I can collect, the easier it is to convince teams to work on this problem.

tomas-zijdemans-vipps commented 1 year ago

Thanks for looking into it!

We are a fin tech that offers a wide array of public APIs (e.g. ecommerce payment, pos payment, recurring payment, login, etc.). We have thousands of customers, big and small, and we want to notify them whenever we detect issues (wrong use of the APIs, payments not being processed, etc.). All errors are logged in Splunk. All customer data is in Salesforce. (Almost) all customers communicate with us on a Slack connect channel. Perfect setup for a Slack bot :)

Here are some use cases: We now have 5 different workflows consisting of 6-8 steps each, the main steps are

Step 1: Get errors from Splunk (this one is problem free) about errors that have been logged
Step 2: Get the accounts from Salesforce for the relevant errors (this one gives us trouble, more details below)
Step 3: Send slack messages to our partners, integrators, plugin developers, internal key account managers (this sometimes gives us errors).

Getting data from Salesforce When retrieving data from Salesforce, it's usually a very small graphql query ("give me 10 accounts" == 50 lines of json in and out). This usually takes 1-3 seconds. But a few times per day, it can take 30 seconds to get a response. We don't know exactly why, and Salesforce can't give us an exact answer. The function is trimmed down to the bone, it only fires 1 request - and it's the response that takes time. No data processing, nothing else. It can't be made more modular. I imagine this problem can happen for other external APIs and use cases. My main gripe with this problem is that Slack simply terminates the workflow. If this execution limit is non-negotiable, a better developer experience would be that the workflow continued to the next step and logged a timeout error (and maybe there could even be a way to catch that a previous function did not execute successfully at some point). The only workaround I can imagine right now is to have a new workflow that would figure out if the first one did it's job.. not pretty.

Posting a lot of messages One of our customer groups is quite large, so we need to send out quite a few notifications. We have 1 function that would like to send messages to about 30 customers (so only using the built in client, no data processing or anything else).

Send a main message to a customer, check if ok
Send a thread response to the main message with more details (to not clog up the channel), check ok
Get the permalink to the threaded message, check ok
Update the main message with a button that links to the threaded message.

This was very well received as a user friendly interaction by commercial users. But as you can see, that's 4*30=120 calls to Slack. The function can get through about 17 customers before it quits due to the 15 sec limit. If you skip checking that the call was ok (not good ofc), or do other smart parallell tricks, you run into rate limiting. The solution we now have is:

Remove step 3 and 4 to half the amount of calls (not great for user experience, but it works).
Have a function that splits the customers into small chunks and then run the above function 4 separate times in the same workflow (works, but the code is really ugly :D )

Chatting with a AI running on Azure On a separate note, we have a LLM running on Azure that allows our users to get help when copywriting. However, azure will put the function "to sleep" when inactive for some time. So the first user in the morning that starts chatting with the AI have to wait for a "warm up". Sometimes this takes to long.

To sum up, yes sometimes you can make functions more modular. But not always. And your workflow gets really complex when you make functions super tiny. I totally agree with the point on interactivity. I remember when developing on the old platform, you had to send an acknowledgement within 3 seconds. That makes sense. But what about when there is an event that triggers the workflow? Or when it's a scheduled trigger? There is no user waiting for a response.

filmaj commented 1 year ago

Thanks for sharing this! These are great examples and I think your points make a lot of sense.

The reason I am thinking about differentiating interactivity handlers vs. other functions is exactly so that developers will be given more freedom (e.g. time) to do stuff in the latter compared to the former. Coupled with an ability to invoke one such function from another easily (without having to create a trigger and encapsulate into separate workflows), I think would address many of these problems.

The team is discussing extending the timeout from 15 seconds to something more, but I worry that inevitably, this same issue will be filed again asking to bump that timeout. Thus why I am trying to think about it from an architectural angle, as it seems the design and current abstractions for functions and workflows are insufficient to meet needs.

filmaj commented 1 year ago

What timeout would be acceptable for your use case, @tomas-zijdemans-vipps ?

tomas-zijdemans-vipps commented 1 year ago

@filmaj I think a 60s timeout would do the trick!

k-farrell commented 9 months ago

Hi! Just wanted to see if there are any updates on extending the timeout or making function timeout configurable? Similarly to the use cases above, we have some API calls we use in our responses that are just very slow and sometimes exceed the 15 seconds causing our bot to not respond even though it got the response from the API post-15-second-limit.

filmaj commented 9 months ago

Hi! Update: the team is actively working on extending the timeouts, but the second-order effects from this are rather wide ranging so we are taking a conservative approach.

We are trying to organize something like a pilot / test for this, where we could roll out the extended timeout to your workspace, understand your use cases and gather feedback from you. If you are interested in being involved with this, Jagdeep, the Slack Product Manager leading this initiative, wants to hear from you in our Community Workspace. If you are not a member of the Community Workspace yet, you can sign up here. In the welcome e-mail will be a link to join the workspace. Within the workspace, there is a #slackapi channel - find Jagdeep's post in that channel from yesterday and get in touch with him!

tsarni commented 8 months ago

Thanks filmaj, I am also using for instance external API calls that sometimes just take time, and for my use case, the longer response times are ok for the users. 15 secs a bit agressive.

filmaj commented 7 months ago

Update here: we are eyeing extending this timeout to 60 seconds but are still working on rolling this out.

devjoes commented 6 months ago

60s would be a great improvement, but could this be customizable? We integrate with Chat GPT4 where latency can vary anywhere between 10 and 300+ secs, depending on load and user input.

filmaj commented 6 months ago

BTW 60 seconds is the now allotted timeout, so I will close this.

I do not think we will be considering extending it beyond 60 seconds, as the function timeouts directly contribute to the perceived performance of workflows for end-users. The original platform had a 3 second timeout for this exact reason. Even pushing it to 60 seconds violates this principle.

devjoes commented 6 months ago

Ok, but if people need to call a slow API then they will just jump through hoops to call the API and then dispatch a function to respond to the user over 60 seconds later. So from a UX point of view it's exactly the same experience as just increasing the timeout, it's just additional work and complexity.

filmaj commented 6 months ago

While I agree with you generally, not all APIs are created equal and ones that take a minute+ to respond should put pressure on consumers to employ special UX considerations, such as letting the user know that a long-running task is being performed to set expectations on responsiveness appropriately.

slackapi / deno-slack-sdk

[QUERY] Could the function execution timeout be increased or configurable? #227