Open hupili opened 11 years ago
@uxian, Note one powerful tool: http://phantomjs.org/
It's a headless browse. We can use it to manipulate page elements and obtain result. In this way, we'll be able to reach more platforms. If one day, service providers block API, this tool could still be effective. They can not stop web access anyway!
Note more tools:
PyQuery
: Enable jQuery style DOM tree navigation in Python. Looks good. Can help clean up the code. python-readability
. Widely used to parse the main content of a general webpage. @fqj1994 @xuanqinanhai You can add things you know
lxml
can also work for HTML Parse
More about lxml: http://lxml.de/
lxml +1
scrapely https://github.com/scrapy/scrapely
:+1: scrapely is what this thread is pursuing! -- A framework that can extract structured data from any (maybe well-formatted) pages. To make easier to use, we can not ask ordinary users to specify "selectors" or "regex" in the config. This learning based approach is good if we can make the accuracy higher.
test file
url1='http://baike.baidu.com/view/10378153.htm'
data={'editor': '苏珊朗格'}
url2='http://baike.baidu.com/view/750254.htm'
s.train(url1, data)
d = s.scrape(url2)
result:
>>> print d[0]['editor'][0].encode('utf-8')
我想你好几天</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='杀广告者' target='_blank' href='http://www.baidu.com/p/%E6%9D%80%E5%B9%BF%E5%91%8A%E8%80%85?from=wk'>杀广告者</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='andynoty' target='_blank' href='http://www.baidu.com/p/andynoty?from=wk'>andynoty</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='ssss9032' target='_blank' href='http://www.baidu.com/p/ssss9032?from=wk'>ssss9032</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='957264812' target='_blank' href='http://www.baidu.com/p/957264812?from=wk'>957264812</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='朱米淇' target='_blank' href='http://www.baidu.com/p/%E6%9C%B1%E7%B1%B3%E6%B7%87?from=wk'>朱米淇</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='hiombi' target='_blank' href='http://www.baidu.com/p/hiombi?from=wk'>hiombi</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='baihdxlove' target='_blank' href='http://www.baidu.com/p/baihdxlove?from=wk'>baihdxlove</a></span> <span> , <a title='查看此用户资料' class='usercard' userName='PaperPas' target='_blank' href='http://www.baidu.com/p/PaperPas?from=wk'>PaperPas
Glad to see some research work there.
Extracting data structure is a kind of data mining. Scrapely library is an awesome library for data mining.
maybe pyquery ?
pyquery is a parser for developers (selector) but not the "HTML parsing platform" we are targeting here.
The notion is explained in this email:
https://groups.google.com/d/msg/scrapely/uvTPPHMHTqo/vxTVyI8SVbgJ
A general HTML parse platform. It can be subclass-ed to enable a wider class of platforms. e.g. Using the same "home_timeline()" interface to get new posts of Baidu Tieba.
Possible solutions:
htmlparser looks most appropriate for prototyping.
refs: