hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
115 stars 62 forks source link

qidian.com - can't scrap the number on the website (special fonts) #85

Open AlexZenghuashan opened 5 years ago

AlexZenghuashan commented 5 years ago

Troubleshooting

Describe your environment

Describe your question

I can't scrap the number about how many words the novel have. The url: https://www.qidian.com/all?chanId=2&subCateId=5&size=1&action=0&orderId=&vip=0&month=3&update=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1 image

The minimum code (snippet) to reproduce the issue

import requests from bs4 import BeautifulSoup url = 'https://www.qidian.com/all?chanId=2&subCateId=5&size=1&action=0&orderId=&vip=0&month=3&update=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1' r=requests.get(url) mypage=BeautifulSoup(r.text) mypage import json json.dumps(a[45].find('span').text) json.dumps(a[48].find('span').text) image

hupili commented 5 years ago

They use a special font to display the numbers. However, those characters are not regular numbers.

screenshot 2018-11-16 at 9 24 33 pm

Need to find a way to "decode" numbers.

hupili commented 5 years ago

This is too hard for our students. Here's the quick solution. It is better to study with some other students together:

https://github.com/hupili/python-for-data-and-media-communication/blob/master/scraper-examples/Qidian%20wordcount.ipynb

AlexZenghuashan commented 5 years ago

Thank you!

AlexZenghuashan commented 5 years ago

image do i need to install something here?

AlexZenghuashan commented 5 years ago

Could you tell me what special modules you used? Thank you!

hupili commented 5 years ago

wget is not a module. It is a Linux/ Unix command. You need to search how to install this tool on your operating system.