starFalll / Spider

新浪微博爬虫(Sina weibo spider),百度搜索结果 爬虫
MIT License
191 stars 57 forks source link

大佬,百度请求页面只能获取第一页,后面的页面需要验证 #5

Open kaixindelele opened 4 years ago

kaixindelele commented 4 years ago

我的搜索如下: data = {'wd': "intitle:疫情 intitle:武汉 site:www.caixin.com", 'tn': 'baiduhome_pg', 'ie': 'utf-8', 'bsst': 1, 'pn': str(i - 1) + '0', } 第一页返回非常正常,获取到的链接,比如下一页的链接,再用page=requests.get(next_page_url, headers=headers) 就无法获取有效信息了,soup的值打印如下: 应该是一个验证码的问题。贼难受,如果大佬之前遇到过,恳请大佬指导!

`soup: <!DOCTYPE html>

<html lang="zh-CN">
<head>
<meta charset="utf-8"/>
<title>ç¾åº¦å®å
¨éªè¯</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<meta content="telephone=no, email=no" name="format-detection"/>
<link href="https://www.baidu.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://www.baidu.com/img/baidu.svg" mask="" rel="icon" sizes="any"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="upgrade-insecure-requests" http-equiv="Content-Security-Policy"/>
<link href="https://wappass.bdimg.com/static/touch/css/api/mkdjump_8befa48.css" rel="stylesheet">
</link></head>
<body>
<div class="timeout hide">
<div class="timeout-img"></div>
<div class="timeout-title">ç½ç»ä¸ç»å¼è¯·ç¨åéè¯</div>
<button class="timeout-button" type="button">è¿å¦é¡µ</button>
</div>
<div class="timeout-feedback hide">
<div class="timeout-feedback-icon"></div>
<p class="timeout-feedback-title">é®é¢é¦</p>
</div>
<script src="https://wappass.baidu.com/static/machine/js/api/mkd.js"></script>
<script src="https://wappass.bdimg.com/static/touch/js/mkdjump_2e06726.js"></script>
</body>
</html>`
starFalll commented 4 years ago

@kaixindelele 百度那个爬虫是我几年前随便写的,找个时间来重构下。看你的描述,需要加入自动验证的代码,建议看下有没有什么方法绕过,毕竟百度不像微博,不用登录就可以获取其搜索全部结果。