hupili / snsapi

Cross platform middleware for Social Networking Services: Twitter, Facebook, SinaWeibo, Renren, RSS, Email, Sqlite, ... (more coming)
http://snsapi.ie.cuhk.edu.hk
159 stars 53 forks source link

HTML Parse Platform #27

Open hupili opened 11 years ago

hupili commented 11 years ago

A general HTML parse platform. It can be subclass-ed to enable a wider class of platforms. e.g. Using the same "home_timeline()" interface to get new posts of Baidu Tieba.

Possible solutions:

htmlparser looks most appropriate for prototyping.

refs:

hupili commented 11 years ago

@uxian, Note one powerful tool: http://phantomjs.org/

It's a headless browse. We can use it to manipulate page elements and obtain result. In this way, we'll be able to reach more platforms. If one day, service providers block API, this tool could still be effective. They can not stop web access anyway!

hupili commented 11 years ago

Note more tools:

hupili commented 11 years ago

@fqj1994 @xuanqinanhai You can add things you know

daimajia commented 11 years ago

lxml can also work for HTML Parse

More about lxml: http://lxml.de/

fqj1994 commented 11 years ago

lxml +1

daimajia commented 11 years ago

scrapely https://github.com/scrapy/scrapely

hupili commented 11 years ago

:+1: scrapely is what this thread is pursuing! -- A framework that can extract structured data from any (maybe well-formatted) pages. To make easier to use, we can not ask ordinary users to specify "selectors" or "regex" in the config. This learning based approach is good if we can make the accuracy higher.

test file

url1='http://baike.baidu.com/view/10378153.htm'
data={'editor': '苏珊朗格'}
url2='http://baike.baidu.com/view/750254.htm'

s.train(url1, data)
d = s.scrape(url2)

result:

>>> print d[0]['editor'][0].encode('utf-8')
我想你好几天</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='杀广告者'  target='_blank' href='http://www.baidu.com/p/%E6%9D%80%E5%B9%BF%E5%91%8A%E8%80%85?from=wk'>杀广告者</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='andynoty'  target='_blank' href='http://www.baidu.com/p/andynoty?from=wk'>andynoty</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='ssss9032'  target='_blank' href='http://www.baidu.com/p/ssss9032?from=wk'>ssss9032</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='957264812'  target='_blank' href='http://www.baidu.com/p/957264812?from=wk'>957264812</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='朱米淇'  target='_blank' href='http://www.baidu.com/p/%E6%9C%B1%E7%B1%B3%E6%B7%87?from=wk'>朱米淇</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='hiombi'  target='_blank' href='http://www.baidu.com/p/hiombi?from=wk'>hiombi</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='baihdxlove'  target='_blank' href='http://www.baidu.com/p/baihdxlove?from=wk'>baihdxlove</a></span> <span> ,  <a title='查看此用户资料' class='usercard' userName='PaperPas'  target='_blank' href='http://www.baidu.com/p/PaperPas?from=wk'>PaperPas

Glad to see some research work there.

daimajia commented 11 years ago

Extracting data structure is a kind of data mining. Scrapely library is an awesome library for data mining.

fqj1994 commented 10 years ago

maybe pyquery ?

http://pythonhosted.org/pyquery/api.html

hupili commented 10 years ago

pyquery is a parser for developers (selector) but not the "HTML parsing platform" we are targeting here.

The notion is explained in this email:

https://groups.google.com/d/msg/scrapely/uvTPPHMHTqo/vxTVyI8SVbgJ