Determine a Good Scraper Data Format

AlpacaFur commented 1 year ago

Summary

Before we start implementing anything we want to make sure that the data format we settle on is good. That way, we can start implementing the Scraper output code and Tooling ingest code in parallel without blocking each other.

Ideally this structure will be both easy for us (as humans) to navigate through while also being easy for our Tooling (and Graduate's backend) to read through.

Tasks

[x] Determine the output format for the flat files. This should include:
- [x] Directory structure (where are majors files located? within a year? within a folder for that specific major?)
- [x] Where individual steps fall under that? (HTML, tokens, parsed)
- [x] Where does metadata live in this structure? (separate file? per major? per college? etc)
- [x] How many copies of the same file do we keep? (just one? one "current" and one "new" if something has changed? more than two?)
- [x] What kind of metadata do we need to store? (last updated time, major review status, etc) (this can and likely will change over time)
[x] Make sure the data format will be easily readable by Graduate's backend too! (ideally Graduate should be able to just pull our repo and reload the majors at runtime).

clue4 commented 1 year ago

https://www.notion.so/sandboxnu/The-Ultimate-Scraper-Docs-c9d9bba6e0cd4c46aa7742eaea2c1e67

(still a wip, but our thoughts so far)

clue4 commented 1 year ago

edit: decisions made in doc

clue4 commented 1 year ago

closing as resolved!

sandboxnu / major-scraper

Determine a Good Scraper Data Format #3

Summary

Tasks