serpapi / serpapi-python

an official Python client library for SerpApi.
https://pypi.org/project/serpapi/
MIT License
46 stars 4 forks source link

How to collect all results using BaiduSearch #7

Closed ehsong closed 9 months ago

ehsong commented 9 months ago

Hello, I am testing SerpAPI with BaiduSearch function. How do I collect all results using the parameter 'rn' and 'pn'? If 'rn' is limited to 50 results, then how do I collect all results? Is there a way to feed in 'other pages' separately to collect all results?

from serpapi import BaiduSearch

params = {
  "api_key": "xxxxxxxxxxxx",
  "engine": "baidu",
  "q": "市民社会",
  "ct": "2",
  "rn": "200",
  "gpc": "stf=1356994860,1696155445|stftype=1",
  "q5": "intitle: '市民社会'",
  "q6": "site:zhihu.com",
  "pn": "20"
}

Under organic results search.get_dict()['organic_results']there were only 10 results listed, so I don't think the parameter 'rn' is working properly. I am using Python 3.9, OS. I inserted the parameters because there were 200 results over 20 pages on baidu, and I wanted to get all the links.

ehsong commented 9 months ago

I figured this out -- I had to reiterate collect 50 posts, offset 50 and collect more until I scraped the rest.

ilyazub commented 9 months ago

Hello @ehsong.

Here's the code example to get all Baidu Search Results pages (ref: https://replit.com/@serpapi/baidu-all-pages-serpapi#main.py)

# Python package: https://pypi.org/project/serpapi
from serpapi import Client as SerpApiClient
import os

params = {
    "engine": "baidu",
    "q": "市民社会",
    "ct": "2",
    "rn": "50",
    "gpc": "stf=1356994860,1696155445|stftype=1",
    "q5": "intitle: '市民社会'",
    "q6": "site:zhihu.com",
    "pn": "20",
}

serpapi = SerpApiClient(api_key=os.environ['SERPAPI_API_KEY'])
search = serpapi.search(params)

print(f"Current page: {search.get('serpapi_pagination', {}).get('current')}\n")

for organic_result in search.get("organic_results", []):
  print(f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n")

for result in search.yield_pages():
  print(
      f"Current page: {result.get('serpapi_pagination', {}).get('current')}\n")

  for organic_result in result.get("organic_results", []):
    print(
        f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n")

If you want to request the data without the SerpApi client library, you may use the serpapi_pagination.next to get the next page URL.

Example

{
  // Omitted...
  "serpapi_pagination": {
    "next": "https://serpapi.com/search.json?ct=1&device=desktop&engine=baidu&f=8&gpc=stf%3D1356994860%2C1696155445%7Cstftype%3D1&oq=%E5%B8%82%E6%B0%91%E7%A4%BE%E4%BC%9A&pn=100&q=%E5%B8%82%E6%B0%91%E7%A4%BE%E4%BC%9A&rn=50",
    // Omitted...
  }
}

If you have any quesions, feel free to ask our support via email (contact@serpapi.com), the Contact Us form (https://serpapi.com/#contact) or the chat widget in the bottom right corner of the https://serpapi.com website.