realCalvin / CmpEndium

https://cmpendium.herokuapp.com/
2 stars 0 forks source link

Webscraping Utils (Optional, if we have time) #27

Open realCalvin opened 3 years ago

realCalvin commented 3 years ago

An issue I ran into when webscraping is being blocked by the website we are trying to scrape. When webscraping hundreds of jobs, you will get blocked for a bit (couple of minutes). To fix this, we need to implement proxy rotations. Another issue we run into is the querying speed. We can implement a puppeteer library that webscrapes links in parallel (similar to multithreading)

1) Puppeteer Webscrape Proxy 2) Puppeteer Parallel Webscraping, https://github.com/thomasdondorf/puppeteer-cluster

In addition to webscraping, we will use this api. Rather than constantly webscraping every user search, we will webscrape jobs weekly and store in our database using CRON job. Weekly, the CRON job will delete all the stored jobs in the database and repopulate it with updated jobs.

realCalvin commented 3 years ago

Ideas to improve rendering/speed for jobs page: