serpapi / google-search-results-python

Google Search Results via SERP API pip Python Package
MIT License
600 stars 97 forks source link

Provide a more convenient way to paginate via the Python package #19

Closed ilyazub closed 3 years ago

ilyazub commented 3 years ago

Currently, the way to paginate searches is to get the serpapi_pagination.current and increase the offset or start parameters in the loop. Like with regular HTTP requests to serpapi.com/search without an API wrapper.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

print(f"Current page: {results['serpapi_pagination']['current']}")

for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

while 'next' in results['serpapi_pagination']:
    search.params_dict[
        "start"] = results['serpapi_pagination']['current'] * 10
    results = search.get_dict()

    print(f"Current page: {results['serpapi_pagination']['current']}")

    for news_result in results["news_results"]:
        print(
            f"Title: {news_result['title']}\nLink: {news_result['link']}\n"
        )

A more convenient way for an official API wrapper would be to provide some function like search.paginate(callback: Callable) which will properly calculate offset for the specific search engine and loop through pages until the end.

import os
from serpapi import GoogleSearch

def print_results(results):
  print(f"Current page: {results['serpapi_pagination']['current']}")

  for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
search.paginate(print_results)

@jvmvik @hartator What do you think?

jvmvik commented 3 years ago

Good idea. We could take advantage of the generator in Python. The callback enables to extract the data and present them using a return. A new paginate method returns a generator which yields the callback until the all the pages are returned. Each page will cost one search.

def callback(results):
  return results["news_results"]

for news_result in search.paginate(callback):
   print(news_result)

The data presentation is isolated from the data processing.

jvmvik commented 3 years ago

Actually we do not need a callback because SerpApi backend does already offer proper pagination.

# initialize the search
search = GoogleSearch({"q": "Coffee", "location": "Austin,Texas"})
# to get 2 pages
start = 0
end = 20
# create a python generator
pages = search.pagination(start, end)
print("display generated")
urls = []
# fetch one result per iteration of the for loop
for result in pages:
    urls.append(result['serpapi_pagination']['next'])

self.assertEqual(len(urls), 2)
self.assertTrue("start=10" in urls[0])
self.assertTrue("start=20" in urls[1])

see commit: f9a470a9efd8c8956cb84597b4c33136670e3abb

ilyazub commented 3 years ago
for result in pages:
  urls.append(result['serpapi_pagination']['next'])

Great! Thank you for the idea with the iterator. It looks cleaner than my initial idea.

Confirming that it works. Example.

https://user-images.githubusercontent.com/282605/116860340-195b5b00-ac0a-11eb-818a-2de63c0c813f.mp4

ilyazub commented 3 years ago

f9a470a works, but sometimes it breaks with the exception on the last page:

Traceback (most recent call last):
  File "main.py", line 16, in <module>
    for result in pages:
  File "/opt/virtualenvs/python3/src/google-search-results/serpapi/pagination.py", line 19, in __next__
    if not 'next' in result['serpapi_pagination']:
KeyError: 'serpapi_pagination'
kikohs commented 3 years ago

I was looking for this info in the documentation but it is not written anywhere. Could you please add it in the python package README and also on the SERPApi website for paying customers? Thank you!

jvmvik commented 3 years ago

thanks for your feedbacks!

@ilyazub code improved to handle missing "serpapi_pagination" field on the last page. @kikohs README updated with information on pagination support.

Library released on pypi