tasos-py / Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python
MIT License
513 stars 137 forks source link

many queries issue #33

Closed xingos123 closed 2 years ago

xingos123 commented 2 years ago

When I run multiple statement cyclic queries, the result of the next query will contain the previous one. How i can clear the previous one?

xingos123 commented 2 years ago

my code is:

` df = pd.read_csv("politifact_fake.csv")[['id', 'title']] df = pd.DataFrame(df)

for index, dfrow in df.iterrows(): id = dfrow['id'] query = str(dfrow['title'])

data = [['query', 'domain', 'URL', 'title', 'text']]
path = id + '.csv'
for j in eng.search(query, pages=3):
    row = [
        query, j['host'], j['link'], j['title'], j['text']
    ]
    row = [encoder(il) for il in row]
    data.append(row)
output.write_file(data, path)
time.sleep(random.randint(2, 7))

`

tasos-py commented 2 years ago

Yes, results are stored in a SearchEngine.results object and every time you call .search() you append more items there. Can't you just create a new eng instance for every iteration of your outer for loop?

for index, dfrow in df.iterrows():
    eng = Google()
    ...
xingos123 commented 2 years ago

🆗,thanks for your answer, it help a lot.

xingos123 commented 2 years ago

@tasos-py and I found that Bing couldn't get the title correctly, and after analyzing the page, I made the following changes:

engines/bing.py line16 'title': 'a'->'title': 'h2'

hope can help others.

tasos-py commented 2 years ago

Thanks, much appreciated! However, I don't have any issues getting title from the a tag. And in the HTML I see that the title text is inside the a tag, which is a child of h2, eg
Capture So, in this case a and h2 should have the same text.

Maybe we're getting different HTML based on our location or maybe it's a BS4 version thing. Could you give me an example of the HTML you see and your BS4 version?

I updated Bing accordingly, because I don't see no harm only benefits, but I'd like to know what's causing this issue.

xingos123 commented 2 years ago

@tasos-py bs4version--4.9.1,location--china if i do not change, it will be ['', '', '', '', '', '', ''], as following: html testR testW

tasos-py commented 2 years ago

Strange. Our HTML is identical and I don't see any reason for a not to have text, since the text content is placed directly in the a tag. Maybe it's because we're using different BS4 versions - I'm using v4.8.1. Either way, I've implemented the changes you suggested. Thanks again!