serpapi / serpapi-javascript

Scrape and parse search engine results using SerpApi.
https://serpapi.com
MIT License
45 stars 4 forks source link

Better support for pagination #2

Closed sebastianquek closed 9 months ago

sebastianquek commented 1 year ago

Goals

This module should make handling pagination simpler across all engines.

Context

Currently, pagination needs to be handled manually. One approach is the following:

import { config, getJson } from "serpapi";

config.api_key = process.env.API_KEY;

const num = 10; // Number of results per page
let start = 0; // Results offset

const links = [];

while (start < 50) { // Get up to 50 results
  const json = await getJson("google", {
    q: "coffee",
    location: "Austin, Texas",
    start,
    num,
  });
  const pageLinks = json["organic_results"].map((r) => r.link);
  links.push(...pageLinks);
  start += num;
}

console.log(links);

This works for engines that support the fetching of results by an offset + size. For example,

However, not all engines rely on this offset + size concept. For example,

For these less common approaches, users will need to be aware of it and update their code accordingly.

Full list of engines that support pagination

There are 7 types:

Offset only, e.g. google_jobs, yahoo Engine | Param | Type | Description -- | -- | -- | -- google_jobs | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). google_reverse_image | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). google_maps | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.). google_events | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). yahoo | b | string | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 1 (default) is the first page of results, 11 is the 2nd page of results, 21 is the 3rd page of results, etc.). yahoo_images | b | string | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 1 (default) starts from the first result, 61 starts from the 61st result, 121 starts from the 121st result, etc.). yahoo_videos | b | string | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 1 (default) starts from the first result, 61 starts from the 61st result, 121 starts from the 121st result, etc.). duckduckgo | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. When pagination is not being used (initial search request), number of organic_results can vary between 26 and 30. When pagination is being used (value of start parameter is bigger then 0), organic_results return 50 results. yelp | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). yelp_reviews | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.).
Page only, e.g. yandex, apple_reviews Engine | Param | Type | Description -- | -- | -- | -- google (google images only) | ijn | string | Parameter defines the page number for Google Images. There are 100 images per page. This parameter is equivalent to start (offset) = ijn * 100. This parameter works only for Google Images (set tbm to isch). yandex | p | string | Parameter defines page number. Pagination starts from 0. yandex_images | p | string | Parameter defines the page number. Pagination starts from 0, and it can return up to 30 results. yandex_videos | p | string | Parameter defines the page number. Pagination starts from 0, and it can return up to 30 results. walmart_product_reviews | page | string | Value is used to get the reviews on a specific page. (e.g., 1 (default) is the first page of results, 2 is the 2nd page of results, 3 is the 3rd page of results, etc.). apple_reviews | page | string | Parameter is used to get the items on a specific page. (e.g., 1 (default) is the first page of results, 2 is the 2nd page of results, 3 is the 3rd page of results, etc.).
Offset + size, e.g. google, bing, baidu Engine | Param | Type | Description -- | -- | -- | -- google | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). google | num | string | Parameter defines the maximum number of results to return. (e.g., 10 (default) returns 10 results, 40 returns 40 results, and 100 returns 100 results). google_scholar | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). google_scholar | num | string | Parameter defines the maximum number of results to return, limited to 20. (e.g., 10 (default) returns 10 results, 20 returns 20 results). google_scholar_author | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.). google_scholar_author | num | string | Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100. bing | first | string | Parameter controls the offset of the organic results. This parameter defaults to 1. (e.g., first=10 will move the 10th organic result to the first position). bing | count | string | Parameter controls the number of results per page. Minimum: 1, Maximum: 50. This parameter is only a suggestion and might not reflect actual results returned. bing_news | first | string | Parameter controls the offset of the organic results. This parameter defaults to 1. (e.g., first=10 will move the 10th organic result to the first position). bing_news | count | string | Parameter controls the number of results per page. This parameter is only a suggestion and might not reflect actual results returned. bing_images | first | string | Parameter controls the offset of the organic results. This parameter defaults to 1. (e.g., first=10 will move the 10th organic result to the first position). bing_images | count | string | Parameter controls the number of results per page. This parameter is only a suggestion and might not reflect the returned results. baidu | pn | string | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). baidu | rn | string | Parameter defines the maximum number of results to return, limited to 50. (e.g., 10 (default) returns 10 results, 30 returns 30 results, and 50 returns 50 results). This parameter is only available for desktop and tablet searches. baidu_news | pn | string | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.). baidu_news | rn | string | Parameter defines the maximum number of results to return, limited to 50. (e.g., 10 (default) returns 10 results, 30 returns 30 results, and 50 returns 50 results).
Offset + page, e.g. google_product Engine | Param | Type | Description -- | -- | -- | -- google_product | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.) This parameter works only for Google Online Sellers and Reviews. google_product | page | string | Parameter defines the page number for Google Online Sellers and Reviews. There are 10 results per page. This parameter is equivalent to start (offset) = page * 10. This parameter works only for Google Online Sellers and Reviews.
Page + size, e.g. ebay, walmart Engine | Param | Type | Description -- | -- | -- | -- ebay | _pgn | string | Parameter defines the page number. It’s used for pagination. (e.g., 1 (default) is the first page of results, 2 is the 2nd page of results, 3 is the 3rd page of results, etc.). ebay | _ipg | string | Parameter defines the maximum number of results to return. There are total of four options: 25, 50 (default), 100 and 200 results. walmart | page | string | Value is used to get the items on a specific page. (e.g., 1 (default) is the first page of results, 2 is the 2nd page of results, 3 is the 3rd page of results, etc.). Maximum page value is 100. walmart | ps | number | Determines the number of items per page. There are scenarios where Walmart overrides the ps value. By default Walmart returns 40 results. apple_app_store | num | string | Parameter defines the number of results you want to get per each page. It defaults to 10. Maximum number of results you can get per page is 200. Any number greater than maximum number will default to 200. apple_app_store | page | string | Parameter is used to get the items on a specific page. (e.g., 0 (default) is the first page of results, 1 is the 2nd page of results, 2 is the 3rd page of results, etc.).
Offset + page + size, e.g. yahoo_shopping, home_depot Engine | Param | Type | Description -- | -- | -- | -- yahoo_shopping | start | number | Parameter defines the result offset. It skips the given number of results. It’s used for pagination. (e.g., 1 (default) is the first page of results, 60 is the 2nd page of results, 120 is the 3rd page of results, etc.). yahoo_shopping | limit | number | Parameter defines the maximum number of results to return. (e.g., 10 (default) returns 10 results, 40 returns 40 results, and 100 returns 100 results). yahoo_shopping | page | string | The page parameter does the start parameter math for you! Just define the page number you want. Pagination starts from 1. home_depot | nao | string | Defines offset for products result. A single page contains 24 products. First page offset is 0, second -> 24, third -> 48 and so on. home_depot | page | string | Value is used to get the items on a specific page. (e.g., 1 (default) is the first page of results, 2 is the 2nd page of results, 3 is the 3rd page of results, etc.). home_depot | ps | number | Determines the number of items per page. There are scenarios where Home depot overrides the ps value. By default Home depot returns 24 results. naver | start | number | Parameter controls the offset of the organic results. This parameter defaults to 1 (except for the web). (e.g. The formula for all searches except the web is start = (page number * 10) - 9 e.g. Page number 3 (3 * 10) - 9 = 21) The formula for the web will be start = (page number * 15) - 29 e.g. Page number 3 (3 * 15) - 29 = 16. naver | num | string | Parameter defines the maximum number of results to return. 50 (default) returns 50 results. Maximum number of results to return is 100.Parameter can only be used with Naver Images API. naver | page | string | The page parameter does the start parameter math for you! Just define the page number you want. Pagination starts from 1.
Token only, e.g. google_scholar_profiles, google_play Engine | Parameter | Type | Description -- | -- | -- | -- google_scholar_profiles | after_author | string | Parameter defines the next page token. It is used for retrieving the next page results. The parameter has the precedence over before_author parameter. google_scholar_profiles | before_author | string | Parameter defines the previous page token. It is used for retrieving the previous page results. google_maps_photos | next_page_token | string | Parameter defines the next page token. It is used for retrieving the next page results. 20 results are returned per page. google_maps_reviews | next_page_token | string | Parameter defines the next page token. It is used for retrieving the next page results.Usage of start parameter (results offset) has been deprecated by Google. google_play | next_page_token | string | Parameter defines the next page token. It is used for retrieving the next page results.

Possible approaches

The key question is how we might abstract the pagination logic in a manner that makes using SerpApi simpler and more ergonomic.

Approach 1: New function

Pros

Cons

Approach 2: Next method

Pros

Cons

Approach 3: Magic?

// calling within a loop works too for await (const page of await getJson("google", { q: "coffee", start: 15 })) { organicResults.push(...page.organic_results); if (organicResults.length >= 50) break; }



### Pros
- Works for single calls or when called in a loop.
- Iterating over the function to get multiple page results is nice.
- Not a breaking change to existing implementations that use `getJson`.
- Simpler to understand than using a brand new function.

### Cons
- Does not support callbacks.
- There are 2 `await`s in the loop, might be confusing.
  - This is required because `getJson` returns a Promise that needs to be awaited to return an object that contains the fetched results and also the instructions necessary to continue the async loop. i.e. returns an [async iterable object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of)
- Types are a little strange as it includes a `[Symbol.asyncIterator]` key which is required for the loop to work.
sebastianquek commented 1 year ago

Pagination with getJsonBySearchId

getJsonBySearchId effectively returns the same object as getJson . Instead of sending in search params, users provide the search’s ID. It could potentially return the .next() method for pagination.

The flow could be something like this:

  1. Get searchId by calling getJson with async=true
  2. Call getJsonBySearchId with searchId
  3. With the response, call .next() , which in turn calls getJson to get the next page synchronously

This might be useful if a user has 2 different services, one that initiates the request and one that processes + fetches additional requests.

Considerations

alonp123 commented 1 year ago

In addition to what @sebastianquek has written about the improvements that can be done, I've just realized that the type of num property (in getJson function) is string instead of number.. It would be nice to fix it on the next version..

    /**
     * Result Offset
     * Parameter defines the result offset. It skips the given number of results. It's
     * used for pagination. (e.g., `0` (default) is the first page of results, `10` is
     * the 2nd page of results, `20` is the 3rd page of results, etc.).
     * Google Local Results only accepts multiples of `20`(e.g. `20` for the second
     * page results, `40` for the third page results, etc.) as the start value.
     */
    start?: number;
    /**
     * Number of Results
     * Parameter defines the maximum number of results to return. (e.g., `10` (default)
     * returns 10 results, `40` returns 40 results, and `100` returns 100 results).
     */
    num?: string;