johndpjr / AgTern

19 stars 5 forks source link

Overhaul Scraping Backend #159

Closed JeremyEastham closed 9 months ago

JeremyEastham commented 10 months ago

A Flexible Solution

When we first created AgTern, we had several engineering challenges that necessitated carefully architected solutions. One of these solutions was scraping_config.json, a single file that would define how all of the companies that we were scraping would be scraped. It contained the URLs that would be scraped, the XPaths for the elements that contained the data we needed, and defined how this data would be saved to the database.

This solution solved several problems:

We successfully implemented configuration parsing! However, this is when this solution's limitations began to become evident.

Evolution Of scraping_config.json

The first hurdle we faced was the size of the file. First, we were just scraping one company, but it was soon over a dozen. We intend to add even more, at least 30 as of writing. This file became cumbersome to work with, so we split it into separate files for each company. This worked well for a while, and it is how we are currently storing scraping information (see data/companies). However, this change only solved one issue with the system.

Our first implementation was simple: lookup the XPaths for the data that we wanted to scrape, get the text from the website, and save it to the database. Not all websites are this simple. On most websites, page navigation is required. On others, a button may need to be clicked to reveal internship details. Sometimes, the user needs to scroll to see all of the internships. To solve all of these problems and more, I created Scrape Actions. Scrape actions are very powerful. Each action object in the config file calls a predefined function that is passed arguments from the config.

To achieve this, I added several systems:

Overengineering At Its Finest

I kept adding features to fill gaps in functionality needed for more unique websites. However, there is still one huge problem: these configurations are inherently linear. One common task was to scrape a list of links, click on each of them, and then retrieve the data off of each page. To solve this problem, I added a link_property argument to the scrape action:

{
    "company": "Example",
    "link": "https://example.com",
    "actions": [
        {
            "action": "scrape",
            "properties": [
                {
                    "name": "link",
                    "xpath": "//xpath/to/internship/links"
                }
            ]
        },
        {
            "action": "scrape",
            "link_property": "link",
            "properties": [
                {
                    "name": "title",
                    "xpath": "//xpath/to/internship/title"
                },
                {
                    "name": "apply_link",
                    "xpath": "//xpath/to/internship/apply/link"
                }
            ]
        }
    ]
}

This worked, but it was a hack. Configurations do not have loops. It is also difficult to read and write. The example above is a simple configuration that represents common behavior. It should not be as difficult as it is to remember which JSON properties are required in which arrangement to achieve this simple behavior that is "built-in" to the system. This system is also difficult to debug and modify. How would we scrape just a few internships for testing? What happens if part of an action fails? What if part of the config is invalid? How do we unit test any of this?

A New Solution

I am tired of trying to make JSON files behave like code. We need to get rid of the scrape action system. Our scraping configuration files should hold links and xpaths, but behavior should be code. Our code should be modular to make our scraping code as simple as possible:

{
    "company": [
        "name": "Example",
        "catagory": "Tech",
        "keywords": [ "a", "b", "c" ]
    ],
    "links": {
        "home": "https://example.com",
        "internships": "https://example.com/internships"
    },
    "xpaths": {
        "link": "//xpath/to/internship/link",
        "title": "//xpath/to/internship/title",
        "description": "//xpath/to/internship/description",
        "info_button": "//xpath/to/internship/info/button",
        "apply_link": "//xpath/to/internship/apply/link",
    }
}
@scrape_company("Example")
def scrape_example():
    scroll_to_bottom()
    for link in scrape("link"):
        goto(link)
        scrape("title", "apply_link")
        click("info_button")
        scrape("description")

@process_internship("Example")
def process_example():
    extract_keywords("title", "description")

Both the configuration and the code are easy to read, easy to modify, and easy to debug. Our data is separated from our logic, and each utility function should be easily unit-testable. This new solution represents a major architectural shift in the backend scraping system of AgTern, but it will lead to a system that is much more robust and easier to work with.

JeremyEastham commented 10 months ago

Work on these changes will be done in the backend/159/overhaul-scraping-backend branch.

Phase 1

Phase 2

To-Do Pre-Merge

To-Do Post-Merge