serpapi / public-roadmap

Public Roadmap for SerpApi, LLC (https://serpapi.com)
51 stars 4 forks source link

[Google Search API] Different Designs for Answer Box not Scraped #242

Closed aliayar closed 4 months ago

aliayar commented 2 years ago

There is a new design for the Answer Box which is shown for certain questions and it is not being scraped by us.

The Playground | The Inspect

new_answer_box

On the other hand, this one is scraped as a knowledge panel but it looks more like an answer box.

The Playground | The Inspect

answer_box_or_kn

The last example is also scraped as in Knowledge Panel but it has a clear border separating it from the knowledge panel:

The Playground | The Inspect

new_answer_box2

Let me know if the second and the third examples should have their own issue. I can open a separate issue.

kagermanov27 commented 2 years ago

image

@aliayar The parsing seems correct to me. I had a similar doubt about a year ago. But investigating into the html shows same kind of structure with knowledge graph. For example: kc:/... part within the data-attrid can be found on many examples of knowledge graph. In its normal form you can see such boxes under the knowledge graph. But for some searches knowledge graph is expanding itself.

You may find a similar example in documentation as well: kg_expanded

aliayar commented 2 years ago

Thank you for the explanation @kagermanov27.

Some users have hard time with finding the corresponding element in the JSON file as it's main key is "president_of_the_united_states_2017"

Nonetheless, the first answer box example from the examples above is still not scraped by SerpApi.

dimitryzub commented 2 years ago

@aliayar On my part, the first example you provided is being parsed by us. Interesting 👀

image


Some users have hard time with finding the corresponding element in the JSON file.

I think this is why Show JSON path option is available in the Playground:

11221

aliayar commented 2 years ago

I think this is why Show JSON path option is available in the Playground:

I think user meant with a list of search terms, it can be hard to know what is the key when you automate this process.

On my part, the first example you provided is being parsed by us.

Now, I see it, too.

I am closing this one.

ofirpress commented 2 years ago

Hi @dimitryzub @kagermanov27 and @aliayar.

I am the user that initially reported these issues. Thanks for fixing the issue with the Lockheed question so quickly!!

I still believe the issue with the other 2 questions has not been solved. Let me explain-

I want to write a function in python:

def answer_using_google(question):
  ...
  ...
  return answer

When I have an 'answer' in 'answer_box' its very easy to write this function.

But if the answer is in 'presidents_of_the_united_states_2017' one time and in 'founders' the other time, then there is no automatic way to parse this. I'm not asking you to put these answers into answer_box, as I understand that these things that google returns here are not actually answer boxes. I'm just asking for an entry in the returned JSON that will tell me where the correct answer is. So for the presidents example it might be "answer_in": "presidents_of_the_united_states_2017" and then in the CNN founders question it might be- "answer_in": 'founders'

Does that explain it better?

Thanks so much

dimitryzub commented 2 years ago

@aliayar Got it! 🙂

@ofirpress Thank you for your clarifications 🙂

But if the answer is in 'presidents_of_the_united_states_2017' one time and in 'founders' the other time, then there is no automatic way to parse this.

You can do it when there's no knowledge graph on the right (which provides more keys) is to iterate through knowledge_graph keys and then dynamically assigning an extracted key to knowledge_graph.dynamic_key:

for key in results["knowledge_graph"]:
    for result in results["knowledge_graph"][key]:
        print(result["name"])

Full example:

from serpapi import GoogleSearch

params = {
    "api_key": "...",
    "engine": "google",
    "q": "U.S. president in 2017",
    "gl": "us",
    "hl": "en",
    "location": "Austin, Texas, United States"
}

search = GoogleSearch(params)
results = search.get_dict()

for key in results["knowledge_graph"]:
    print(key) # president_of_the_united_states_2017
    for result in results["knowledge_graph"][key]:
        print(result["name"])

Outputs:

president_of_the_united_states_2017
Donald Trump
Barack Obama

However, this approach will not work when there's more than 1 dict key.


What if to use if statement? Since we know the name of keys, we can just check for their existence, for example:

if "president_of_the_united_states_2017" in results["knowledge_graph"]:
        for result in results["knowledge_graph"]["president_of_the_united_states_2017"]:
            print(result["name"])

if "founders" in results["knowledge_graph"]:
    for result in results["knowledge_graph"]["founders"]:
        print(result["name"])

Full example:

from serpapi import GoogleSearch
import json

for query in ["U.S. president in 2017", "Founder of CNN"]:
    params = {
        "api_key": "...",
        "engine": "google",
        "q": query,
        "google_domain": "google.com",
        "gl": "us",
        "hl": "en",
        "location": "Austin, Texas, United States"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    if "president_of_the_united_states_2017" in results["knowledge_graph"]:
        for result in results["knowledge_graph"]["president_of_the_united_states_2017"]:
            print(result["name"])

    if "founders" in results["knowledge_graph"]:
        for result in results["knowledge_graph"]["founders"]:
            print(result["name"])

Outputs:

Donald Trump
Barack Obama

Ted Turner
Reese Schonfeld

Let me know if there is anything else I can help you with 🌞

ilyazub commented 4 months ago

SerpApi successfully scrapes these answer boxes. Closing this issue as resolved.

image (ref: https://serpapi.com/playground?q=when+was+lockheed+corporation+founded&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en)

image (https://serpapi.com/playground?q=U.S.+president+in+2017&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en)

image (ref: https://serpapi.com/playground?q=Founder+of+CNN&location=Austin%2C+Texas%2C+United+States&gl=us&hl=en)