We've found out how to scrape fresh internships, but how we update internships now is less-than-ideal. We should be updating internships instead of ignoring duplicates (see TODO in scraper.py). If we had functionality for updating all internships for a job site, our data and work will be a lot cleaner.
TODO
[ ] Add a CLI argument --add-new-jobs that enables the following algorithm
[ ] Basic algorithm: scrape a site: for each job, if there doesn't exist a job with the same company and posting_link, add it to the database. Don't delete old jobs and don't update jobs. This behavior will be implemented later.
[ ] Implement update algorithm into the scraper
Notes
Don't implement or worry about running these updates periodically, that will be another ticket.
Context
We've found out how to scrape fresh internships, but how we update internships now is less-than-ideal. We should be updating internships instead of ignoring duplicates (see TODO in
scraper.py
). If we had functionality for updating all internships for a job site, our data and work will be a lot cleaner.TODO
--add-new-jobs
that enables the following algorithmcompany
andposting_link
, add it to the database. Don't delete old jobs and don't update jobs. This behavior will be implemented later.Notes
Don't implement or worry about running these updates periodically, that will be another ticket.