Open alexellis opened 6 years ago
I spoke with @iyovcheva about this yesterday. My initial thoughts are that jobs are orthogonal to functions, but they are a common ask from the community around longer-running batch processing and CI jobs. Is there an opportunity to add value here?
Nomad and Kubernetes both have these primitives but I don't believe Swarm does. It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.
I can see this being a larger ask from the community going forward. I have been thinking a lot about this myself for personal and work-related tasks. It would be nice to have something that can be scheduled to run at a specific time but not be locked into time constraints. I also feel that it is a slippery slope to not turn the project into something like Rundeck. It may be worth waiting to see what Swarm does before building anything. Leveraging the primitives of something already build is probably going to be a smaller lift.
It looks like Swarm may get the primitive in the near future from the issue but not sure how long that will be.
Having read the thread and following the thread, I think this is unlikely to happen in the near future.
Kubernetes can support this use-case natively, Swarm may need a separate controller writing to make this possible - I made a start with a CLI tool which has a few users called JaaS
The Rundeck project looks interesting btw.
There are many such scenarios in the actual business. Containers can do this kind of work well. Pull up the container when using it,Complete work to destroy the container. We are doing this based on openfaas. When invoked, put the request in a queue and deploy function. Record the number of call requests and completions. If there is no new request in a certain period of time and all operations are completed, delete function. Wait for the next invoked
Hello -- just commenting here as well to share our science use-case. "Black box" functions that are expensive to evaluate are a common setup in optimization problems. Often a machine-learning based strategy defines at what parameters a function is evaluated in order to reduce the number of evaluations: see e.g. https://scikit-optimize.github.io/ or https://github.com/diana-hep/excursion function evaluations can easily be multiple hours. My ideal user-interface would be
> faas submit --parameters '{"a": 1, "b": "Hello World"}'
http://some.url/to/future
> faas ready http://some.url/to/future
false
> faas ready http://some.url/to/future
true
> faas retrieve http://some.url/to/future
{"value": 1.23}
Thanks for adding your use-case.
What if instead of checking for the result in a stateful way, you specified a callback URL in an event-driven way?
Is "job checking" a hard requirement too?
I think you could do this basic flow with the asynchronous processing using a long enough max timeout. Each async function call returns a call ID which you get back on the callback.
Alex
Hi -- I think a callback would work equally well. definitely not a hard requirement. Are you thinking one call back for all function calls (where the callback specifies some invocation id) or a unique callback url per call?
Is there any update?
I am using csharp functions, If a function is receiving the request on single thread then I can start a new thread with some unique key and return that in response. Then on the other request i can send the key and check the status of the thread.
I didn't try to implement this until now because I am new and testing the OpenFaas for our need and don't know how it will behave to threading and searching for the solution which is already available.
I was planing to replace the windows services which with OpenFaas. These service are running the scheduled task which usually takes 3 to 10 minutes but now we also need that to run those tasks on demand.
Hi @sheryever you can run for 3-5 mins, no problems. Threads are also fair game. 👍
I don't think you need what I'm calling long-running jobs for that.
Alex
Some of the requirements/constraints I'm hearing from users:
Assumed, but need users to confirm:
A thin wrapper around a Kubernetes Job, or clear documentation on this use-case using Kubernetes jobs may be enough for a large % of the people asking for the above, but this is unclear. I think it would be worth looking into with 1-2 of the people who need this.
I'd like to add my user case. We have many teams use different automation frameworks with different language.
To schedule the tests, we'd like to treat each framework as a function. When there's a new available build, we can just invoke all functions by passing related parameters. Each test may running for several hours
When there's code change to the framework and merged, it'll update the function without breaking on-going request. For new request, route to latest function. For old on-going request, wait for it to complete and upgrade the function.
For the async function, it could support the auto scaling based on the request number limits in the function.
Thanks @alexellis and thanks @OpenFaaS.
This is an intriguing problem... more than a couple use cases being described here, but as Alex pointed out, there are some high level similarities/requirements.
Just thinking out loud about this, I can't think of how we would "know" that a function is still executing. Perhaps a "status" function available in the OF core to receive messages from functions. I'm thinking like setting a variable on a function indicating it as a long running job (ie functionType: job
) that would then start a routine that would push updates to the main status function in order to be able to query/report on the status. Just something simple with an in-memory map of [functionName]status
that would be updated on post from the function's watchdog.
To summarize, the list of things in my head to accomplish this would be:
functionType: job
)openfaas/system/status?function=big-data-process
big-data-process
is the function name as defined in the yml fileSome questions I haven't thought of a way to answer yet:
What about a batch
mode in the of-watchdog. This would run the function method to finish and then stop. This would make it easy to use the same image as a function of the image in an Argo workflow / k8s job / etc.
By default the watchdog could just send an "empty" request. But it could also accept a file path and parse any of the files as requests, one at a time, to the method. The response would either be dropped or saved (one per file) to a file path. This would be very Argo friendly. Alternatively, it could accept an s3 compatible server, bucket, and path and read/write from there.
I don't think openfaas needs to be the batch job runner, but if we make it really easy to just use a function in a batch job system. That would go a long way.
Technically this would all be possible without any changes from us, you can write a template that contains watchdog and some other init script and just use a different command in your container when you use it in batch job. But documenting this and making it an approved workflow instead of a workaround would probably make people happy
@alexellis is the goal that "functions" should be compatible with batch jobs, e.g. take an openfaas function and run it in Argo/pure k8s jobs/kubeflow/etc or is it that we want to create another job system in which we will take any docker image that exposes the "function interface" of a server on 8080 and then run it as a one time job?
That is a good question. Maybe it will be both?
What do our users need?
Hi, I've been working on different use cases involving long-running batch jobs and these are my thoughts.
After trying different scaling configurations with asyncronous invokations, mainly based on CPU consumption and increasing the number of replicas of the queue-worker, my colleagues and I came to the conclusion that it was more convenient to use kubernetes jobs. However, we wanted to take advantage of openfaas' ability to invoke functions through the gateway, so we decided to create oscar-worker as a substitute for nats-queue-worker. Its goal is to convert invocations that reach nats through the /async-function/
route into k8s jobs. It is not a very elegant solution, since the NATS queue wouldn't be necessary for this purpose, but it does its job.
My idea of a better integration for long-running jobs in openfaas would be adding some tag to the functions in order to indicate that they are long-running functions/jobs. These functions will have a new route in the gateway, for example /job/
. When a request is sent to this route, the gateway will convert the request to a kubernetes job. The result could be displayed in the logs or sent via callback using a sidecar (or init-container + container) in the job.
I think this approach wouldn't be so hard to implement and would cover the needs of a huge number of users.
Also potentially useful / interesting relating to workflows - https://github.com/s8sg/faas-flow
Hi everyone! I've been looking for an ideal solution to run long-running ETL that take a few hours to run. They are mostly involved with database to database data transfer. Currently, we're just running on some VMs with an in-house scheduler, but we'd like to do better than this. OpenFaas seemed like a possible solution, as we could run "functions" on demand. However, I am unsure of the suitability for something like OpenFaas to run stuff that takes that long. There has been some great discussion here about what is currently missing from OpenFaas in this domain. Namely:
Something that I was not clear about was whether you could run something like this on OpenFaas. I know that AWS lambda has a hard timeout of 15 minutes per invocation. Does OF have something like this, or is it just that it may not be reliable to run a function for that long?
If someone could give me a high level idea of what Kubernetes Jobs offers that OF does not currently and what is gained by putting an OF layer on top of Jobs would gain us, that would help me immensely.
Thanks!
For us, even for things that only run let's say 2-3 minutes, I think we would quite appreciate having the 3 features @zhl146 mentioned: status, retries and cancelling.
You get all of that by simply just using a Kubernetes Job
(@zhl146), which for us is pretty viable, since most of our long-running jobs don't have the exact semantics of a function - e.g. they are usually purely side-effects and don't need to return anything.
However, one of the reasons OpenFaas is attractive to us is that we can deploy each job as "a piece of code that can be triggered by an HTTP request", which helps decouple the job itself from the means of running it. For example, you can have a CronJob
calling the function every 15min, while at the same time being able to call it manually/reactively, without deploying the business logic of the job twice, or creating separate container images.
Therefore, I think it would be beneficial to be able to use the OpenFaas API to reuse the code it already has in the container to spawn the K8S job, instead of having a separate flow for it.
As an alternative, I also see value in making it possible to re-use the OF-built image and just run the function without the watchdog. That way we could just use the K8S API separately to run the jobs but still reusing the same contianer image. But I think we would prefer the OF-integrated option.
I am very interested in this and would love to help push this forward. For my use case, our team uses Airflow for our ETL processes and OpenFaas functions for the actual processing of files. We have found this to be a really nice combo as we can more easily test each of the different processes without having to bloat our Airflow code. Airflow then simply wires up the different functions and handles the retries and failures.
Right now we have a OpenFaas function called record-function
to record when another function has started/completed/failed by storing the status in Redis. We use it by first calling function/record-function/{unique-id}/start
where unique-id
is just a uuid we use to indentify the function. We then call our actual function async-function/my-long-running-function
and pass function/record-function/{unique-id}/stop
as the callback function. We then use an Airflow sensor to poll the Redis database to see when the function has completed and then continue with the rest of the workflow.
It would be great if we could instead kick off a long running function as a k8s job and then poll the status of the function by making a call to the OpenFaas gateway with something like system/function/my-long-running-function
and get back the k8s job status response.
I have started using OF for my project and ended up with this issue(#657). I have requirement/need long running function using async but is there any other way for
Is possible to use OpenFaaS with Argo Workflows? This will give much more flexibility to the users build complex flow processing capabilities.
A few requests have come up on Slack recently:
These all seem like job semantics that would fit in with the discussion on this issue.
An approach which may work with the existing primitives, without changing OpenFaaS is:
For each request create a $RANDOM_UID then run faas-cli deploy --image function/image $RANDOM_UID
with an async callback to a "done" function.
Set the function not to scale to zero, and have one replica.
The done function deletes the function faas-cli delete $RANDOM_UID
Failed invocations still come back to the "done" function.
For status checking, the done function could write to some storage like a database table, which would allow for in-progress detection, fetching the result and for cancellation.
None of this would require Kubernetes Jobs or limiting to only working on K8s, however there will be some edge cases. If anyone here is still interested in "jobs for openfaas", I'd suggest prototyping the above and seeing how well it works for you.
There's some other areas that may need further probing like identity and request signing so that Mallory cannot simply invoke the "done" function with custom function names and use that to abuse the system.
@koladilip sure, go ahead. You can invoke a function endpoint via HTTP from an Argo workload or run it as a container and use a sidecar to invoke it (I created an example for @csakshaug for this last year, but cannot find it right now). How far did you get with what you were trying?
The CD project Tekton has also been popularised since this thread was created, whilst it's aimed at Continuous Deployment, it has a "Pipeline" mechanism that may be interesting to some users -> https://github.com/tektoncd/pipeline
cc @aledbf @tmiklas
I would welcome usecases and examples of current job workflows and how you would see it working differently in openfaas, and making things easier for you.
Hi! For long running and complex workflows i am learning about https://temporal.io/ from the creators of Uber's Cadence and wrote a starting tutorial with a golang function https://sergiotm87.github.io/blog/post/temporalio-workflows-with-openfaas-functions/
Mitchell Hashimoto had recently said they are runing temporal to orchestrate HashiCorp Cloud Platform
A couple of resources people might find interesting:
Quick PoC to run a Kubernetes job and print out the logs -> https://github.com/alexellis/lavoro An openfaas template to make puppeteer on Kubernetes easy -> https://github.com/alexellis/openfaas-puppeteer-template
Whilst working on "lavoro" - I think I had a question about how jobs in openfaas would differ from our current functions vision, and whether they are the same thing.
Jobs such as processing a video will need a file injecting as an input, and collecting as output, unless the code itself manages that.
Jobs may not have a HTTP server since they only process one request, they may just be a container with a "CMD" that runs to completion
Jobs won't necessarily have an API in the same way as our current functions do, so it's hard to interface with them. What is the lowest common denominator? It's no longer a HTTP request/response exchange.
Hi, we are currently researching about file processing using Kubernetes jobs. As you pointed out, this kind of processing must manage input and output files, so there must be a component that takes charge of obtaining/saving files from a data storage provider.
In our case we have developed OSCAR2, which depends on a MinIO deployment in the same cluster, being in charge of invoking the functions/jobs. Our tool is able to create and configure the bucket notifications of MinIO from the job spec. The component in charge of the input and output of files is FaaS Supervisor, a binary that is automatically mounted via a volume in the jobs. To support synchronous invocations we have integrated it with OpenFaaS (redirecting the requests to the gateway) and, in addition, we have added a log recovery service to know the status of the jobs. Workflows can be achieved by linking input/output buckets of different functions.
If anyone is interested in using it, do not hesitate to contact us. We are currently updating the documentation, but we already have a helm chart ready to install on any Kubernetes cluster.
Hi @srisco I am aware of your project and have taken a look at the approach of replacing the asynchronous NATS worker.
Since we added multiple-queue support, you no longer need to take away the regular asynchronous invocations, but can additively provide your OSCAR queue-worker on another "queue name"
See also: Multiple named queue support
What limitations have you found with the use of Kubernetes jobs? And if you took your learnings and wanted to see them applied upstream in the original project, how would you go about that now? What would it look like to suit your needs?
I also saw that you've written your own OpenFaaS UI which looks very similar to ours in some respects. We're also considering rebuilding a new UI with React or Angular. Have you thought about what it would take to release a version of your UI that could be used with the upstream project?
Feel free to chat with us on OpenFaaS Slack
We welcome contributions from users of the project, and also have an open call for sponsors. If you can think of a way to support the upstream project in some way, that would be appreciated.
Glad you have found value in OpenFaaS for your solution, I hope that we can collaborate in some way going forward?
Alex
@Sergiotm87 thanks for pointing us at Temporal. Is that product open-source, or paid-for only?
I noticed on your blog that the code examples are collapsed, and I visited it twice and ignored the code examples. Is there a way you can stop them from collapsing? I think you'll be missing out on people having an "aha" moment because they can't see the code.
If anyone in the community cares, I've been pushing for a number of relatively small items that work together in such a way where I'm very close on my end to being able to support arbitrarily long-running jobs that are handled in a gracefully autoscaled fashion.
Hi all, will be happy to share with you my use case,
we got a library with a lot of scientifics functions,
each function can be call through a CLI , each are cpu/memory intensive use and long-time running.
We want to give access to this functions to our datascientists in a k8s Cluster So the aim was to convert the library into an openfaas image, to be able to call each function (with parameters) through http, and get the result back through the async callback... this is pretty simple and "openfaas" easy...
But for security reason, we need to run only one function per container, like batch or job , unfortunately i do not find the way to do it with openfaas so that's why i am pretty interest in this thread.
@srisco thanks, i will give a look to Oscar And @kevin-lindsay-1 , i am pretty interested in your work , please give us some information.
I wrote up the changes we made for Surge (where @kevin-lindsay-1 works) here:
Improving long-running jobs for OpenFaaS users
Commercial users can get in touch with us immediately via https://openfaas.com/support instead of waiting for this to come up on the roadmap or in a triage call.
@alexellis thanks, this is very interesting , i was playing with pre-stop hook to avoid my downscaled pod to get trash while still computing stuff.
Now i will have to wait for your change, not commercial user there :(
You got nothing in mind to have 1 container per request ?
There's ways to do this already, but why do you want that?
we do scientific processing for different projects, and in one container we are not authorized to process for 2 different projects, ie we must not have data from different projects in same container.
Who came up with the boundary of a container? Why not a VM? Why not a process?
The scientific processing isn't commercial? Someone must fund it in some way. You're welcome to speak to that person and suggest that they book a call with us. You'll find a link on that page I shared.
Happy to walk you through how this would work that way. Of course you have the docs and all the readme files on GitHub that can be read freely too.
@alexellis let s go through mail to not pollute this thread.
Should we support long-running batch-jobs?
Are these in the scope of OpenFaaS functions which are typically several seconds in duration?
Are there primitives in Kubernetes such as jobs which we can leverage?
Kubernetes jobs example
Edit 29 Sept 2019
Since long-running jobs and workflows are related, I've added that to the title, so if you're looking for workflow, please feel free to comment with your use-case and whether it's for business purposes or for fun.