jarun / googler

:mag: Google from the terminal
GNU General Public License v3.0
6.11k stars 529 forks source link

Does it support to get results of top stories from SERP? #361

Closed guyfromhongkong closed 4 years ago

guyfromhongkong commented 4 years ago

Would it be possible to return results of top stories from SERP? As I can see in current version, it only supports organic rankings only. Thanks!

zmwangx commented 4 years ago

Sounds like a reasonable request. I’ll see if those can be easily extracted later. How they should be presented in our linear flow is already up to debate.

zmwangx commented 4 years ago

Hey, finally took a look at this.

The result isn't satisfactory. The parser would be rather brittle due to multiple "Top stories" layouts I've witnessed at the same time, and the general indistinguishability with video carousel, Twitter carousel, etc. (I only realized it would pick up video carousel results after I've generated the patch below, and now I can't bother to further develop it.)

Also, you can only see the first three results. The rest are rendered by JS.

Apply the patch if you'd like to include this experimental functionality. I doubt we'll add it.

From 06e70e23f8086bb68d98893bbaeaa9df3639ef89 Mon Sep 17 00:00:00 2001
From: Zhiming Wang <i@zhimingwang.org>
Date: Sun, 11 Oct 2020 23:19:12 +0800
Subject: [PATCH] Add experimental support for "Top stories"

---
 googler | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/googler b/googler
index b1479a7..e2865d2 100755
--- a/googler
+++ b/googler
@@ -2343,6 +2343,35 @@ class GoogleParser(object):
         cw = lambda s: re.sub(r'[ \t\n\r]+', ' ', s) if s is not None else s

         index = 0
+
+        # Try to parse "Top stories".
+        #
+        # Detection doesn't work on all pages! E.g. when I search
+        # "covid" the layout for "Top stories" is simply different....
+        carousel = tree.select('g-section-with-header g-scrolling-carousel')
+        if carousel:
+            # Devise a really crappy strategy to tell a "Top stories"
+            # carousel apart from a Twitter carousel, which unfortunately
+            # shares the same structure.
+            section = next(el for el in carousel.ancestors() if el.tag == 'g-section-with-header')
+            if section.first_element_child().select('title-with-lhs-icon'):
+                # This section contains a title-with-lhs-icon (":newspaper
+                # icon: Top stories") which a Twitter carousel doesn't have,
+                # a good sign...
+                for card in carousel.select_all('g-inner-card'):
+                    heading = card.select('[role=heading]')
+                    title = heading.text
+                    a = card.select('a')
+                    url = a.attr('href')
+                    metadata_node = heading.parent.last_element_child()
+                    metadata = metadata_node.text if metadata_node is not heading else ''
+                    result = Result(index + 1, cw(title), url, '',
+                                    metadata=cw(metadata), sitelinks=[], matches=[])
+                    if result not in self.results:
+                        self.results.append(result)
+                        index += 1
+
+        # Regular results.
         for div_g in tree.select_all('div.g'):
             if div_g.select('.hp-xpdbox'):
                 # Skip smart cards.
-- 
2.28.0

Also available as a gist: https://gist.github.com/zmwangx/ce643da063bc6b259e83a46dfd719946.