hudsonbay / google_scraper_live_view

Application for extracting large amounts of data from the Google search results page
16 stars 0 forks source link

Crawly usage #3

Closed olivierobert closed 3 years ago

olivierobert commented 3 years ago

GoogleScraper makes use of the package crawly.

Upon checking the documentation for the package, it seems to be a great solution to deeply crawl an entire website. However, the current usage seems limited to make a request to a single page. The current implementation does not make use Crawly.Spider for instance 🤔

Why not use other more popular HTTP libraries such as httpoison or tesla?

hudsonbay commented 3 years ago

Yeah, why not? In fact, crawly uses httpoison as the HTTP client under the hood. So the layer we're using for calling HTTPoison functions is crawly . But that has a potential problem, you don't have control over HTTPoison responses because crawly is doing everything for you.

For example, you cannot pattern match an error response from HTTPoison like {:error, :nxdomain} because crawly is assuming that you have internet all the time or the domain to crawl will always be up, for example.

So, as the case was not pattern matched inside the crawly implementation you will receive an exception, instead of an error message. I think it will be fixed at some point or maybe I should try to collaborate and open a PR on that. The crawly community and its creator are very friendly.

We agree that crawly is is not a bad decision since it could be useful if you want to support more crawling functionalities.

crawly is a great library but I agree with you that HTTPoison is a good option for this use case and parse the body with floki.

My question is, do you want me to open a PR with a different implementation (using httpoison, for example)?

If I were working on a real project I'd definitely do it

olivierobert commented 3 years ago

Thank you for detailing that httpoison is used the hood by crawly. So in effect, this dependency has a larger implementation surface than httpoison i.e. it can do more things.

From my perspective, however, this extra power is not used at the moment. So in line with picking the right tool for the job, I find that crawly is like a hammer while a screwdriver is needed (sorry for the average metaphor 😅 ). I guess I also come from the Elixir mindset of limiting dependencies as much as possible. Were it be in a Ruby/Rails environment, it might not cause such a fuss.

There is no need to make a fix for this change.

hudsonbay commented 3 years ago

like a hammer while a screwdriver is needed (sorry for the average metaphor 😅 )

😅