feat: scraper enhancements - Githubissues

icssc / anteater-api

API that provides easy access to public data from UC Irvine. Developed for Anteaters, by Anteaters.

https://anteaterapi.com/reference

GNU Affero General Public License v3.0

3 stars 0 forks source link

feat: scraper enhancements #20

Closed ecxyzzy closed 2 weeks ago

ecxyzzy commented 2 weeks ago

Description

Changes to the study room scraper:

Update database schema to ensure uniqueness of each (study room ID, start, end) tuple. This resolves the issue of that table ballooning to many millions of rows.

Changes to the WebSoc scraper:

Remove the DEFAULT NOW directive for updated_at columns to ensure it will be updated by the scraper on an upsert.
Add a column to the websoc_meta table that tracks the last department that was scraped successfully. If this column is not null for any term eligible to be scraped, the scraper will prioritize that term and start where the scraper left off.

Misc fixes:

Explicitly close the connection upon scraper/migration termination to avoid possible deadlocks.

Related Issue

Closes #4.

How Has This Been Tested?

For the study location scraper:

Run scraper once locally.
Run SELECT COUNT(1) FROM study_room_slot on local dev db.
Run scraper again.
Run the above query again and verify that the number is the same, or close to the same. It may differ if run before/after the half-hour mark, since that's usually when additional availability is revealed.

For the WebSoc scraper:

Run scraper once locally.
Ctrl-C it before it finishes a full scrape.
Run scraper again and verify that it picks up where you stopped it last.
Note the value of the updated_at column for an arbitrary row belonging to the term.
Run scraper again.
Verify that the updated_at column was updated properly.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code involves a change to the database schema.
[ ] My code requires a change to the documentation.