Control flow in returning scraping details

hudsonbay / google_scraper_live_view

Application for extracting large amounts of data from the Google search results page

16 stars 0 forks source link

Control flow in returning scraping details #4

Open olivierobert opened 3 years ago

olivierobert commented 3 years ago

The below code seems to imply that if no data is found for total_results, no scraping info is returned:

https://github.com/hudsonbay/google_scraper_live_view/blob/22ee211c8df4d802fe0879824e005b05ebb4aa95/lib/google_scraper.ex?_pjax=%23js-repo-pjax-container%3Afirst-of-type%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%3Afirst-of-type%2C%20%5Bdata-pjax-container%5D%3Afirst-of-type#L66-L79

Why make the control flow this way? Would not it be better to return scraping information as long as the Google search result page is parseable?

hudsonbay commented 3 years ago

It has a reason. For example, If you type something like jsdfg0isdgijseg09sdfgoidsfg9 you will receive no results from Google. And the HTML attribute #result-stats won't exist. Results won't exist and the result page won't be parseable. Try it in your browser with this word for a better understanding of what I'm saying.

Screenshot_20211120_173149

What I was basically trying to do was supporting that edge case.

But I think I could have commented that part of the code (leaving a note for better understanding).

As it is a private function, I won't document it with a @doc attribute.

One of the things I like from Elixir is its simplicity and cleaning, so I share the idea that code itself should be the best documentation. If code is speaking for itself then it's a good sign. If you need to comment it, it's because maybe it's too complex. But there are cases when putting a comment is relevant for the readers of that code.

olivierobert commented 3 years ago

Well understood on the case the implementation tries to catch. This case can indeed happen. From my perspective, beyond the need to document this case or not, what is more surprising is why this attached to total_results as it could also lead to the same case for total_links.

Maybe something like the following would be more explicit:

with true <- valid_search_result_page(document),
       total_results <- parse_total_results(document),
       total_links <- parse_links(document) do
   #...
else 
   false -> #...
end

The check is more explicit and there is room for more error handling for each parsing error 💭

hudsonbay commented 3 years ago

Oh, yes, you are right!! I didn't see it that way. I think yours is a better approach because even when there's no results you can still have links. For example:

Screenshot_20211123_001227

So, the number of results would be zero but the total links won't.

When I was doing the project I never thought on that. Now I see it. Thank you