serpapi / google-search-results-python

Google Search Results via SERP API pip Python Package
MIT License
571 stars 93 forks source link

google scholar pagination skips result 20 #26

Closed samuelhaysom closed 1 year ago

samuelhaysom commented 3 years ago

When retrieving results from Google Scholar using the pagination() method, the first article on the second page of google scholar is always missing.

I think this is caused by the following snippet in the update() method of google-search-results-python/serpapi/pagination.py:

def update(self):
        self.client.params_dict["start"] = self.start
        self.client.params_dict["num"] = self.num
        if self.start > 0:
            self.client.params_dict["start"] += 1

This seems to mean that for all pages except the first, paginate increases start by 1. So while for the first page it requests results starting at 0 and ending at 19 (if page_size=20). For the second page it requests results starting at 21 and ending at 40, skipping result 20.

If I delete the if statement, the code seems to work as intended and I get result 19 back.

ilyazub commented 3 years ago

@jvmvik Can you take a look?

dimitryzub commented 2 years ago

@samuelhaysom Currently, the best approach would be to use serpapi_pagination instead as you also mentioned in #25 issue. When #30 is merged, the pagination() method would be the preferred one. Sorry for such a long reply.

if "next" in results.get("serpapi_pagination", {}):
    search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
else:
    break

For example:

# Google Scholar Search API

from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)

params = {
  "api_key": "...",             # serpapi api key
  "engine": "google_scholar",   # search engine
  "q": "minecraft redstone",    # language
  "hl": "en"                    # search query
}

search = GoogleSearch(params)   # where data extraction happens

# to show page number
page_num = 0

# iterate over all pages
results_is_present = True
while results_is_present:
    results = search.get_dict()  # JSON -> Python dict

    if "error" in results:
        print(results["error"])
        break

    page_num += 1
    print(f"Current page: {page_num}")

    # iterate over organic results and extract the data
    for result in results.get("organic_results", []):
        print(result.get("position"), result.get("title"), sep="\n")

    # check if the next page key is present in the JSON
    # if present -> split URL in parts and update to the next page
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
    else:
        break