Set up Celery for task automation

johndpjr commented 11 months ago

Context

Right now, scraping jobs are run manually. We can automate this by using Celery, a distributed task queue framework that allows us to manage and execute tasks asynchronously (a task being some function). For our use, Celery will automate webscraper and data processing jobs so that we can manage multiple webscrapers at once with ease. Eventually, jobs will run on their own accord at specified times. This greatly speeds up the web-scraping and data analysis portion of AgTern and removes the need for manual job starts.

TODO

[x] Research and learn basics about Celery; complete their tutorial
[x] Add Celery as a project dependency (i.e. add "celery==5.3.4" to the requirements.txt file) and install it by running the pip3 install -r requirements.txt again
[x] Add any basic Celery configuration (if needed) to the project. See "Notes" for clarification
[x] Research how to implement parallel tasks with Celery and make a comment in this issue with your findings (just list some actionable steps forward from here to automating our webscrapers)

Notes

This story does not require you to actually run any jobs (i.e. just install Celery, don't run any tasks right now). Using Celery to run jobs will be more complex and is out of the scope of this story.

rutaceae commented 11 months ago

Celery requires the use of a message broker to queue tasks. Celery highly recommends RabbitMQ. Before Celery can be implemented, we will need to set up a message broker.

Celery also requires a place in the backend to store the result of a task. A Celery application on its own cannot store/provide the return of any given function. More precisely, Celery reports where in a backend it has stored a result.

To implement a basic Celery app into a module: from celery import Celery app = Celery(name, broker='broker url', backend='backend url')

Then, any function that is to be run asynchronously is decorated with @app.task. This allows the function to be sent to a worker using function.delay(args) where the arguments are sent to the original function.

A Celery worker can be run using celery -A filename worker --loglevel=info where filename is a module with a celery app. In production this would need to be daemonized.

It seems as though implementing Celery to the existing code will be relatively simple. A Celery app can be created for the necessary scraping functions and those functions are sent to queue with the arguments for any given website.

rutaceae commented 10 months ago

Closed with PR 154.

johndpjr / AgTern