An issue I ran into when webscraping is being blocked by the website we are trying to scrape. When webscraping hundreds of jobs, you will get blocked for a bit (couple of minutes). To fix this, we need to implement proxy rotations. Another issue we run into is the querying speed. We can implement a puppeteer library that webscrapes links in parallel (similar to multithreading)
In addition to webscraping, we will use this api. Rather than constantly webscraping every user search, we will webscrape jobs weekly and store in our database using CRON job.
Weekly, the CRON job will delete all the stored jobs in the database and repopulate it with updated jobs.
Whenever a user searches a keyword and location on our website, we will check MongoDB if this has been searched before. If it hasn't, it will webscrape the jobs, store it into MongoDB and then display. (this will still be slow). However, if the user has searched a keyword/location that has been searched before in our database, (given that it has been searched recently in the past week), we will just return that.
Idea above is similar to 'caching' for keywords and location
An issue I ran into when webscraping is being blocked by the website we are trying to scrape. When webscraping hundreds of jobs, you will get blocked for a bit (couple of minutes). To fix this, we need to implement proxy rotations. Another issue we run into is the querying speed. We can implement a puppeteer library that webscrapes links in parallel (similar to multithreading)
1) Puppeteer Webscrape Proxy 2) Puppeteer Parallel Webscraping, https://github.com/thomasdondorf/puppeteer-cluster
In addition to webscraping, we will use this api. Rather than constantly webscraping every user search, we will webscrape jobs weekly and store in our database using CRON job. Weekly, the CRON job will delete all the stored jobs in the database and repopulate it with updated jobs.