HTML Parse Platform - Githubissues

hupili commented 11 years ago

A general HTML parse platform. It can be subclass-ed to enable a wider class of platforms. e.g. Using the same "home_timeline()" interface to get new posts of Baidu Tieba.

Possible solutions:

htmlparser
BeautifulSoup
Regex
mechanize

htmlparser looks most appropriate for prototyping.

refs:

hupili commented 11 years ago

@uxian, Note one powerful tool: http://phantomjs.org/

It's a headless browse. We can use it to manipulate page elements and obtain result. In this way, we'll be able to reach more platforms. If one day, service providers block API, this tool could still be effective. They can not stop web access anyway!

hupili commented 11 years ago

Note more tools:

PyQuery: Enable jQuery style DOM tree navigation in Python. Looks good. Can help clean up the code.
python-readability. Widely used to parse the main content of a general webpage.

hupili commented 11 years ago

@fqj1994 @xuanqinanhai You can add things you know

daimajia commented 11 years ago

lxml can also work for HTML Parse

More about lxml: http://lxml.de/

fqj1994 commented 11 years ago

lxml +1

daimajia commented 11 years ago

scrapely https://github.com/scrapy/scrapely

hupili commented 11 years ago

:+1: scrapely is what this thread is pursuing! -- A framework that can extract structured data from any (maybe well-formatted) pages. To make easier to use, we can not ask ordinary users to specify "selectors" or "regex" in the config. This learning based approach is good if we can make the accuracy higher.

test file

url1='http://baike.baidu.com/view/10378153.htm'
data={'editor': '苏珊朗格'}
url2='http://baike.baidu.com/view/750254.htm'

s.train(url1, data)
d = s.scrape(url2)

result:

>>> print d[0]['editor'][0].encode('utf-8')
我想你好几天</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='杀广告者'  target='_blank' href='http://www.baidu.com/p/%E6%9D%80%E5%B9%BF%E5%91%8A%E8%80%85?from=wk'>杀广告者</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='andynoty'  target='_blank' href='http://www.baidu.com/p/andynoty?from=wk'>andynoty</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='ssss9032'  target='_blank' href='http://www.baidu.com/p/ssss9032?from=wk'>ssss9032</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='957264812'  target='_blank' href='http://www.baidu.com/p/957264812?from=wk'>957264812</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='朱米淇'  target='_blank' href='http://www.baidu.com/p/%E6%9C%B1%E7%B1%B3%E6%B7%87?from=wk'>朱米淇</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='hiombi'  target='_blank' href='http://www.baidu.com/p/hiombi?from=wk'>hiombi</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='baihdxlove'  target='_blank' href='http://www.baidu.com/p/baihdxlove?from=wk'>baihdxlove</a></span> <span> ，  <a title='查看此用户资料' class='usercard' userName='PaperPas'  target='_blank' href='http://www.baidu.com/p/PaperPas?from=wk'>PaperPas

Glad to see some research work there.

daimajia commented 11 years ago

Extracting data structure is a kind of data mining. Scrapely library is an awesome library for data mining.

fqj1994 commented 10 years ago

maybe pyquery ?

http://pythonhosted.org/pyquery/api.html

hupili commented 10 years ago

pyquery is a parser for developers (selector) but not the "HTML parsing platform" we are targeting here.

The notion is explained in this email:

https://groups.google.com/d/msg/scrapely/uvTPPHMHTqo/vxTVyI8SVbgJ

hupili / snsapi

HTML Parse Platform #27