serpapi / public-roadmap

Public Roadmap for SerpApi, LLC (https://serpapi.com)
54 stars 5 forks source link

[Yelp New API] Scrape brands #1875

Open sonika-serpapi opened 2 months ago

sonika-serpapi commented 2 months ago

A high volume customer reached out asking to scrape reviews pertaining to a brand: https://www.yelp.com/brands/unifirst

Currently, they would have the aggregate across all stores to get this information about a brand, but since Yelp exposes this publicly for each brand, perhaps we can add support for scraping this page for each brand.

Specific brand page:

Screenshot 2024-08-27 at 1 14 27 PM

All brands:

Screenshot 2024-08-27 at 1 14 43 PM

Intercom

btaunt commented 2 months ago

Does it make sense to set this up under its own "Yelp Brands" API? Similar to how we do Yelp Place or Yelp Reviews?

That way, searches are passed against yelp.com/brands/ directly? Just my initial thought process to keep things cleaner.

sonika-serpapi commented 2 months ago

@btaunt I think that is a good idea. I do think there is more value in setting it up under it's own "Yelp Brands" API, similar to Yelp Reviews, as there is no direct search for for the brands I believe. Instead there is a list maintained at https://www.yelp.com/brands, and there is no location needed to get the brand information page.

A second point I wanted to bring up is, would we need to maintain this brand list on our end?

I'll let others chime in on this as well.

kingmeers commented 2 months ago

A second point I wanted to bring up is, would we need to maintain this brand list on our end?

I have this snippet which is exactly for this issue's purpose, to get yelp brands json. Insanely enough, they don't have a paginated API, but rather a single, enormous JSON, as displayed on one page at the /brands route.

It's blocked by cors, so you'll have to have it be requested from the site's resources. Here's a quick way to save the json from the /batch gql endpoint that meets the brands json format (as they have multiple /batch calls, for different types of data)

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  // Launch a headless browser
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Set up request interception to listen for the specific GraphQL request
  await page.setRequestInterception(true);
  page.on('request', (interceptedRequest) => {
    if (interceptedRequest.url() === 'https://www.yelp.com/gql/batch') {
      interceptedRequest.continue();
    } else {
      interceptedRequest.continue();
    }
  });

  // Listen to the responses to capture the payload
  page.on('response', async (response) => {
    if (response.url() === 'https://www.yelp.com/gql/batch') {
      const jsonResponse = await response.json();

      // Check for specific properties before saving
      if (jsonResponse[0] && jsonResponse[0].data && jsonResponse[0].data.brandEntityIndex) {
        const brands = jsonResponse[0].data.brandEntityIndex.brands;
        if (brands && brands.length > 0 && brands[0].name && brands[0].urlAlias) {
          fs.writeFileSync('output-yelp.json', JSON.stringify(jsonResponse, null, 2));
          console.log('Captured JSON saved to output-yelp.json');
        }
      }
    }
  });

  // Go to the Yelp page that triggers the request
  await page.goto('https://www.yelp.com/brands', { waitUntil: 'networkidle2' });

  // Wait for some time to ensure all requests are made
  await page.waitForTimeout(5000);

  // Close the browser
  await browser.close();
})();
schaferyan commented 2 months ago

Thank you @kingmeers ! We will take this under consideration when developing a solution for this.