hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
117 stars 62 forks source link

Can't scrape all data from all pages #126

Closed CathyChang1996 closed 5 years ago

CathyChang1996 commented 5 years ago

Troubleshooting

Describe your environment

Describe your question

I found that when I scrape the pages in mafengwo, the code cannot completely go through all the pages. Here is the page link:http://www.mafengwo.cn/wenda/u/5017124/answer.html

The total number of the pages is 85, but the Jupiter notebook completed its running when it's on Page 44, and all the data I scraped copied, so the number of all data seems right but the real content duplicates.

I wonder if it is a kind of anti-crawling process.

Describe the efforts you have spent on this issue


from selenium import webdriver
browser = webdriver.Chrome()
all_location=[]
all_dates=[]
browser.get(url)
time.sleep(2)
for i in range(85):
    time.sleep(0.5)
    try:
        date=browser.find_elements_by_css_selector('span.meta-item.meta-time')
        location=browser.find_elements_by_css_selector('span.label-mdd')
        dates=[]
         for d in date:
            da=d.text
            dates.append(da)
         locations=[]
         for l in location:
                lo=l.text
                locations.append(lo)
        all_location.extend(locations)
        all_dates.extend(dates)
        browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.2);')
        next_page = browser.find_element_by_partial_link_text('下一页')
        next_page.click()
    except Exception as e:
        print(e)
        print('Error on page %s' % i)```
ChicoXYC commented 5 years ago

@CathyChang1996

After testing, I think the main reasons here are scraping too fast and the scroll times may not by 1 time each page.

Solutions :

  1. add longer sleep time in for loop, like 2 seconds.
  2. enlarge the for loop range. Maybe more than 100. After drop duplicates, There are 833 enrties of data, which is evry close to the right answers. You can refer here: https://github.com/ChicoXYC/exercise/blob/master/mafengwo/mafengwo-pagination.ipynb
CathyChang1996 commented 5 years ago

Thanks a lot!! Hope this time would work!!