Closed Jocelyn0830 closed 2 years ago
With 27K+ site that we want to crawl, maybe, even repeatedly, it is worthwhile to look into which cloud providers we can use. In principle there are three:
All of these can get pretty complicated, especially, AWS. So, my sense is that GPC or Azure would be easier. For example, there is Azure Virtual Desktop. This is probably not the best for our use case but I am mentioning it because of the Headless vs. Non-headless Mode point below. The best I know Firebase, which is also part of GCP. But it is also probably not what we are looking for.
Heroku may also be an option. But there the issue is that we do not immediately have an OS to run a browser on.
->The basic questions is: what do people use for crawling the web at scale?
One important point that @Jocelyn0830 brought up today is that we may run into an issue with the headless vs. non-headless mode. If we are switching to some cloud architecture, can we still crawl in non-headless mode? If so, do we need to do any modification, and which modifications are those. If we are not able to use non-headless mode, can we modify our crawler to work in headless mode?
A less important point that we can look into once we figured the previous two points is how to deploy. For example, Puppeteer mentions Docker.
The main point that @Jocelyn0830 mentioned in our crawl is that in headless mode it may not be possible to install extensions. So, is there a workaround for this? The other issue --- using keyword commands to trigger running and stopping the analysis --- seems less of an issue. @Jocelyn0830 will continue exploring ...
I read over different documentations of cloud platforms, and most cloud platforms only support headless automate browsers. It seems to me that the easiest way is to develop a crawler that works in headless mode.
Here are the problems if we want to make a headless crawler:
Our current crawler works in headful mode, so we could do all above manually. However, having a headless crawler means that everything needs to be done programmatically.
I got stuck on problem 1 described above using Puppeteer. I tried many methods in Puppeteer but they all failed, so I decided to explore other possible options. After experimenting with different webdrivers, Selenium seems to be the one that works for us. (1) Selenium supports javascript, and it provides functions similar to those used in our current crawler. (2) Selenium supports different versions of Firefox, while Puppeteer only supports Firefox Nightly. (3) Selenium provides an api that can programmatically install Firefox addon in .xpi format.
Several problems I encountered:
I successfully implemented a headless crawler using selenium.
As discussed today, @Jocelyn0830 will continue exploration of the backend. This can be simple (e.g., Firebase) or local (just MacMini as before).
We do not need much of a UI. We can just issue a command, the crawl starts crawling, and the data is being saved. There does not need to be much/any display via the browser.
We may also need a different backend for storing the data, e.g., results are stored in a database instead of in the csv file.
@Jocelyn0830 will create a list of features and make a first call which ones go into the crawler, which go into the extension, and which can be removed from one or the other.
Then, we need to think about organizing our codebase. We can use this or other new repos.
And if you have any one-off expenses let me know, @Jocelyn0830, and I will reimburse you. Once we settle on a definite solution, I will set up an account and credit card to handle the monies.
We cam explore the possibility of doing cloud crawling.