Research item: Long running batch-jobs / workflow

alexellis commented 6 years ago

Should we support long-running batch-jobs?

Are these in the scope of OpenFaaS functions which are typically several seconds in duration?

Are there primitives in Kubernetes such as jobs which we can leverage?

Edit 29 Sept 2019

Since long-running jobs and workflows are related, I've added that to the title, so if you're looking for workflow, please feel free to comment with your use-case and whether it's for business purposes or for fun.

alexellis commented 6 years ago

I spoke with @iyovcheva about this yesterday. My initial thoughts are that jobs are orthogonal to functions, but they are a common ask from the community around longer-running batch processing and CI jobs. Is there an opportunity to add value here?

cpitkin commented 6 years ago

Nomad and Kubernetes both have these primitives but I don't believe Swarm does. It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.

I can see this being a larger ask from the community going forward. I have been thinking a lot about this myself for personal and work-related tasks. It would be nice to have something that can be scheduled to run at a specific time but not be locked into time constraints. I also feel that it is a slippery slope to not turn the project into something like Rundeck. It may be worth waiting to see what Swarm does before building anything. Leveraging the primitives of something already build is probably going to be a smaller lift.

alexellis commented 6 years ago

It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.

Having read the thread and following the thread, I think this is unlikely to happen in the near future.

Kubernetes can support this use-case natively, Swarm may need a separate controller writing to make this possible - I made a start with a CLI tool which has a few users called JaaS

The Rundeck project looks interesting btw.

CuZn13 commented 6 years ago

There are many such scenarios in the actual business. Containers can do this kind of work well. Pull up the container when using it,Complete work to destroy the container. We are doing this based on openfaas. When invoked, put the request in a queue and deploy function. Record the number of call requests and completions. If there is no new request in a certain period of time and all operations are completed, delete function. Wait for the next invoked

lukasheinrich commented 5 years ago

Hello -- just commenting here as well to share our science use-case. "Black box" functions that are expensive to evaluate are a common setup in optimization problems. Often a machine-learning based strategy defines at what parameters a function is evaluated in order to reduce the number of evaluations: see e.g. https://scikit-optimize.github.io/ or https://github.com/diana-hep/excursion function evaluations can easily be multiple hours. My ideal user-interface would be

> faas submit --parameters '{"a": 1, "b": "Hello World"}'
http://some.url/to/future
> faas ready http://some.url/to/future
false
> faas ready http://some.url/to/future
true
> faas retrieve http://some.url/to/future
{"value": 1.23}

alexellis commented 5 years ago

Thanks for adding your use-case.

What if instead of checking for the result in a stateful way, you specified a callback URL in an event-driven way?

Is "job checking" a hard requirement too?

I think you could do this basic flow with the asynchronous processing using a long enough max timeout. Each async function call returns a call ID which you get back on the callback.

Alex

lukasheinrich commented 5 years ago

Hi -- I think a callback would work equally well. definitely not a hard requirement. Are you thinking one call back for all function calls (where the callback specifies some invocation id) or a unique callback url per call?

sheryever commented 5 years ago

Is there any update?

I am using csharp functions, If a function is receiving the request on single thread then I can start a new thread with some unique key and return that in response. Then on the other request i can send the key and check the status of the thread.

I didn't try to implement this until now because I am new and testing the OpenFaas for our need and don't know how it will behave to threading and searching for the solution which is already available.

I was planing to replace the windows services which with OpenFaas. These service are running the scheduled task which usually takes 3 to 10 minutes but now we also need that to run those tasks on demand.

alexellis commented 5 years ago

Hi @sheryever you can run for 3-5 mins, no problems. Threads are also fair game. 👍

I don't think you need what I'm calling long-running jobs for that.

Alex

alexellis commented 5 years ago

Some of the requirements/constraints I'm hearing from users:

run for up to several hours
status can be checked
can be cancelled
new version can be scheduled without cancelling the existing versions executing

Assumed, but need users to confirm:

uses or looks like a function
invoked or scheduled in a similar way to a function

A thin wrapper around a Kubernetes Job, or clear documentation on this use-case using Kubernetes jobs may be enough for a large % of the people asking for the above, but this is unclear. I think it would be worth looking into with 1-2 of the people who need this.

lihaiswu commented 5 years ago

I'd like to add my user case. We have many teams use different automation frameworks with different language.

To schedule the tests, we'd like to treat each framework as a function. When there's a new available build, we can just invoke all functions by passing related parameters. Each test may running for several hours
When there's code change to the framework and merged, it'll update the function without breaking on-going request. For new request, route to latest function. For old on-going request, wait for it to complete and upgrade the function.
For the async function, it could support the auto scaling based on the request number limits in the function.

Thanks @alexellis and thanks @OpenFaaS.

burtonr commented 5 years ago

This is an intriguing problem... more than a couple use cases being described here, but as Alex pointed out, there are some high level similarities/requirements.

Just thinking out loud about this, I can't think of how we would "know" that a function is still executing. Perhaps a "status" function available in the OF core to receive messages from functions. I'm thinking like setting a variable on a function indicating it as a long running job (ie functionType: job) that would then start a routine that would push updates to the main status function in order to be able to query/report on the status. Just something simple with an in-memory map of [functionName]status that would be updated on post from the function's watchdog.

To summarize, the list of things in my head to accomplish this would be:

New function to report status and cancel "jobs" (as part of the OpenFaaS core functions)
Update watchdog to accept new parameter to mark a function as a long-running job (ie functionType: job)
That will then enable a new status reporting routine (background task) to push data to the main "status function" on an interval
- "running" | "duration: 15m30s" | "completed" ...something like that
GET openfaas/system/status?function=big-data-process
- Where big-data-process is the function name as defined in the yml file

Some questions I haven't thought of a way to answer yet:

What about scaling?
- How would we report status on 2+ long running functions of the same name?
- Would you just get status on the latest one?
- Then the user would only be notified when all the jobs were complete...

LucasRoesler commented 5 years ago

What about a batch mode in the of-watchdog. This would run the function method to finish and then stop. This would make it easy to use the same image as a function of the image in an Argo workflow / k8s job / etc.

By default the watchdog could just send an "empty" request. But it could also accept a file path and parse any of the files as requests, one at a time, to the method. The response would either be dropped or saved (one per file) to a file path. This would be very Argo friendly. Alternatively, it could accept an s3 compatible server, bucket, and path and read/write from there.

I don't think openfaas needs to be the batch job runner, but if we make it really easy to just use a function in a batch job system. That would go a long way.

Technically this would all be possible without any changes from us, you can write a template that contains watchdog and some other init script and just use a different command in your container when you use it in batch job. But documenting this and making it an approved workflow instead of a workaround would probably make people happy

LucasRoesler commented 5 years ago

@alexellis is the goal that "functions" should be compatible with batch jobs, e.g. take an openfaas function and run it in Argo/pure k8s jobs/kubeflow/etc or is it that we want to create another job system in which we will take any docker image that exposes the "function interface" of a server on 8080 and then run it as a one time job?

alexellis commented 5 years ago

That is a good question. Maybe it will be both?

What do our users need?

srisco commented 5 years ago

Hi, I've been working on different use cases involving long-running batch jobs and these are my thoughts.

After trying different scaling configurations with asyncronous invokations, mainly based on CPU consumption and increasing the number of replicas of the queue-worker, my colleagues and I came to the conclusion that it was more convenient to use kubernetes jobs. However, we wanted to take advantage of openfaas' ability to invoke functions through the gateway, so we decided to create oscar-worker as a substitute for nats-queue-worker. Its goal is to convert invocations that reach nats through the /async-function/ route into k8s jobs. It is not a very elegant solution, since the NATS queue wouldn't be necessary for this purpose, but it does its job.

My idea of a better integration for long-running jobs in openfaas would be adding some tag to the functions in order to indicate that they are long-running functions/jobs. These functions will have a new route in the gateway, for example /job/. When a request is sent to this route, the gateway will convert the request to a kubernetes job. The result could be displayed in the logs or sent via callback using a sidecar (or init-container + container) in the job.

I think this approach wouldn't be so hard to implement and would cover the needs of a huge number of users.

alexellis commented 5 years ago

Also potentially useful / interesting relating to workflows - https://github.com/s8sg/faas-flow

zhl146 commented 5 years ago

Hi everyone! I've been looking for an ideal solution to run long-running ETL that take a few hours to run. They are mostly involved with database to database data transfer. Currently, we're just running on some VMs with an in-house scheduler, but we'd like to do better than this. OpenFaas seemed like a possible solution, as we could run "functions" on demand. However, I am unsure of the suitability for something like OpenFaas to run stuff that takes that long. There has been some great discussion here about what is currently missing from OpenFaas in this domain. Namely:

No way to check on the status of a running async function
No ability to retry on failure
No way to cancel a running job

Something that I was not clear about was whether you could run something like this on OpenFaas. I know that AWS lambda has a hard timeout of 15 minutes per invocation. Does OF have something like this, or is it just that it may not be reliable to run a function for that long?

If someone could give me a high level idea of what Kubernetes Jobs offers that OF does not currently and what is gained by putting an OF layer on top of Jobs would gain us, that would help me immensely.

Thanks!

valorl commented 5 years ago

For us, even for things that only run let's say 2-3 minutes, I think we would quite appreciate having the 3 features @zhl146 mentioned: status, retries and cancelling.

You get all of that by simply just using a Kubernetes Job (@zhl146), which for us is pretty viable, since most of our long-running jobs don't have the exact semantics of a function - e.g. they are usually purely side-effects and don't need to return anything.

However, one of the reasons OpenFaas is attractive to us is that we can deploy each job as "a piece of code that can be triggered by an HTTP request", which helps decouple the job itself from the means of running it. For example, you can have a CronJob calling the function every 15min, while at the same time being able to call it manually/reactively, without deploying the business logic of the job twice, or creating separate container images.

Therefore, I think it would be beneficial to be able to use the OpenFaas API to reuse the code it already has in the container to spawn the K8S job, instead of having a separate flow for it.

As an alternative, I also see value in making it possible to re-use the OF-built image and just run the function without the watchdog. That way we could just use the K8S API separately to run the jobs but still reusing the same contianer image. But I think we would prefer the OF-integrated option.

ameier38 commented 5 years ago

I am very interested in this and would love to help push this forward. For my use case, our team uses Airflow for our ETL processes and OpenFaas functions for the actual processing of files. We have found this to be a really nice combo as we can more easily test each of the different processes without having to bloat our Airflow code. Airflow then simply wires up the different functions and handles the retries and failures.

Right now we have a OpenFaas function called record-function to record when another function has started/completed/failed by storing the status in Redis. We use it by first calling function/record-function/{unique-id}/start where unique-id is just a uuid we use to indentify the function. We then call our actual function async-function/my-long-running-function and pass function/record-function/{unique-id}/stop as the callback function. We then use an Airflow sensor to poll the Redis database to see when the function has completed and then continue with the rest of the workflow.

It would be great if we could instead kick off a long running function as a k8s job and then poll the status of the function by making a call to the OpenFaas gateway with something like system/function/my-long-running-function and get back the k8s job status response.

sguruswa commented 4 years ago

I have started using OF for my project and ended up with this issue(#657). I have requirement/need long running function using async but is there any other way for

Check on the status of a running async function
Retry on failure
Stop/cancel a running job
job called twice when call async can anyone help me....please

koladilip commented 4 years ago

Is possible to use OpenFaaS with Argo Workflows? This will give much more flexibility to the users build complex flow processing capabilities.

alexellis commented 4 years ago

A few requests have come up on Slack recently:

"run one container per function"
"to check the progress of an invocation"
"to cancel an inflight invocation"

These all seem like job semantics that would fit in with the discussion on this issue.

An approach which may work with the existing primitives, without changing OpenFaaS is:

For each request create a $RANDOM_UID then run faas-cli deploy --image function/image $RANDOM_UID with an async callback to a "done" function. Set the function not to scale to zero, and have one replica. The done function deletes the function faas-cli delete $RANDOM_UID

Failed invocations still come back to the "done" function.

For status checking, the done function could write to some storage like a database table, which would allow for in-progress detection, fetching the result and for cancellation.

None of this would require Kubernetes Jobs or limiting to only working on K8s, however there will be some edge cases. If anyone here is still interested in "jobs for openfaas", I'd suggest prototyping the above and seeing how well it works for you.

There's some other areas that may need further probing like identity and request signing so that Mallory cannot simply invoke the "done" function with custom function names and use that to abuse the system.

alexellis commented 4 years ago

@koladilip sure, go ahead. You can invoke a function endpoint via HTTP from an Argo workload or run it as a container and use a sidecar to invoke it (I created an example for @csakshaug for this last year, but cannot find it right now). How far did you get with what you were trying?

alexellis commented 4 years ago

The CD project Tekton has also been popularised since this thread was created, whilst it's aimed at Continuous Deployment, it has a "Pipeline" mechanism that may be interesting to some users -> https://github.com/tektoncd/pipeline

alexellis commented 4 years ago

cc @aledbf @tmiklas

alexellis commented 4 years ago

I would welcome usecases and examples of current job workflows and how you would see it working differently in openfaas, and making things easier for you.

sergiotm87 commented 4 years ago

Hi! For long running and complex workflows i am learning about https://temporal.io/ from the creators of Uber's Cadence and wrote a starting tutorial with a golang function https://sergiotm87.github.io/blog/post/temporalio-workflows-with-openfaas-functions/

Mitchell Hashimoto had recently said they are runing temporal to orchestrate HashiCorp Cloud Platform

alexellis commented 4 years ago

A couple of resources people might find interesting:

Quick PoC to run a Kubernetes job and print out the logs -> https://github.com/alexellis/lavoro An openfaas template to make puppeteer on Kubernetes easy -> https://github.com/alexellis/openfaas-puppeteer-template

Whilst working on "lavoro" - I think I had a question about how jobs in openfaas would differ from our current functions vision, and whether they are the same thing.

Jobs such as processing a video will need a file injecting as an input, and collecting as output, unless the code itself manages that.

Jobs may not have a HTTP server since they only process one request, they may just be a container with a "CMD" that runs to completion

Jobs won't necessarily have an API in the same way as our current functions do, so it's hard to interface with them. What is the lowest common denominator? It's no longer a HTTP request/response exchange.

srisco commented 4 years ago

Hi, we are currently researching about file processing using Kubernetes jobs. As you pointed out, this kind of processing must manage input and output files, so there must be a component that takes charge of obtaining/saving files from a data storage provider.

In our case we have developed OSCAR2, which depends on a MinIO deployment in the same cluster, being in charge of invoking the functions/jobs. Our tool is able to create and configure the bucket notifications of MinIO from the job spec. The component in charge of the input and output of files is FaaS Supervisor, a binary that is automatically mounted via a volume in the jobs. To support synchronous invocations we have integrated it with OpenFaaS (redirecting the requests to the gateway) and, in addition, we have added a log recovery service to know the status of the jobs. Workflows can be achieved by linking input/output buckets of different functions.

If anyone is interested in using it, do not hesitate to contact us. We are currently updating the documentation, but we already have a helm chart ready to install on any Kubernetes cluster.

alexellis commented 4 years ago

Hi @srisco I am aware of your project and have taken a look at the approach of replacing the asynchronous NATS worker.

Since we added multiple-queue support, you no longer need to take away the regular asynchronous invocations, but can additively provide your OSCAR queue-worker on another "queue name"

openfaas / faas

Research item: Long running batch-jobs / workflow #657

Edit 29 Sept 2019