iberryful / weixin_sogou

爬取微信公众号文章
http://weirss.me
MIT License
751 stars 203 forks source link

Cannot get direct essay link from parse_list #6

Open einverne opened 8 years ago

einverne commented 8 years ago

After call parse_list , I can only get URL like this http://weixin.sogou.com/websearch/art.jsp?sg=CBf80b2xkgZWehj5vWa6p7H14b.... . However, most weixin essay's direct link is something like this http://mp.weixin.qq.com/s?__biz=MjM5NjM4OTAyMA=.... . If you request the first link you can get 302 redirection. Response header:

HTTP/1.1 302 Found
Server: nginx
Date: Sun, 08 Nov 2015 07:52:29 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
x_log_ext: suv=008A25C47B7F4D26563EFED6A708B610&openid=oIWsFt3nvJ2jaaxm9UOB_LUos02k&query=%E7%AE%80%E4%B9%A6&rank=1&from=gzhjs&page=1&snuid=76022F344F557452C69017DF5064A727&dec_art=succ
Location: http://mp.weixin.qq.com/s?__biz=MjM5NjM4OTAyMA==&mid=400225618&idx=1&sn=4560366d10e320d2feddf1ce0e00bf0e&3rd=MzA3MDU4NTYzMw==&scene=6#rd
Set-Cookie: black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
Expires: Sun, 08 Nov 2015 07:52:29 GMT
Cache-Control: max-age=0

The question is that I cannot get the redirect url using requests package:

r = requests.get(link, headers=headers, cookies=cookies)
print(r.headers)
print(r.url)
for resp in r.history:
    print(resp.status_code, resp.url)

I try to use this code to get response Location in the header. But I always get 200 status code not 302. And get 当前请求已过期,请点击重新加载 error. Did I miss something?

PegasusWang commented 8 years ago

meet the same problem