hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
118 stars 62 forks source link

How to scrape all the data in a page with infinite scrolling #127

Closed CathyChang1996 closed 5 years ago

CathyChang1996 commented 5 years ago

Troubleshooting

Describe your environment

Describe your question

When I tried to scrape all the location names in a page, I found it should be with a infinite scrolling process to get all the information, how to make it happen?

The minimum code (snippet) to reproduce the issue


filtered=[]
for u in url:
    filter=u[:-5]
    filtered.append(filter)

import time
from selenium import webdriver
review_url=[]
for urls in filtered:
    review= '{}/review.html'.format(urls)
    review_url.append(review)

browser = webdriver.Chrome()
data=[]
for i in review_url:
    browser.get(i)
    time.sleep(2)
    try:
        number= browser.find_elements_by_tag_name('b')[2].text
        place= browser.find_elements_by_css_selector('h3.title')
        place_of_interests=[]
        for p in place:
            aab=p.text
            place_of_interests.append(aab)
        cleaned_place=[]
        for x in place_of_interests:
            clean=x.replace('\n', '/')
            cleaned_place.append(clean)
        data.append([i,number,cleaned_place])
        browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    except:
        print(Error_Raise)
CathyChang1996 commented 5 years ago

url = ['http://www.mafengwo.cn/u/69932798/review.html']

ChicoXYC commented 5 years ago

My suggestion is you can add a for loop to give a range for the scroll times, like: scroll 100 times, you can adjust by your observation of how many articles there are.

for i in range(1,00):
     ...
    browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    except:

Another way is to use while true

#condition here
    while True:
        try:
            ...
            browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        except:
            print(Error_Raise)
CathyChang1996 commented 5 years ago

Problem solved! Thanks a lot!

hupili commented 5 years ago

Although the problem is solved, I leave an alternative solution here: https://github.com/hupili/python-for-data-and-media-communication/blob/master/scraper-examples/mafengwo-xhr.ipynb

Crawling by network trace analysis and XHR is also common. However, the analysis is case by case so we only briefly mentioned in the notes

ChicoXYC commented 5 years ago

Closed. Merged this issue into FAQs, please refer here.