Open GoogleCodeExporter opened 8 years ago
My reading of this bug is that you would like to have the ability to plug-in a
parser(such as beautifulsoup) of your own for performing parsing of web-pages,
instead of getting stuck to the default parser.
Is that correct ? HarvestMan does provide the sgmlop-based parser which can
circumvent (perhaps not fix) bad (X)HTML, but this could be useful I think.
However how would this interfacing look like ? Will you specify a module and a
class-name from where to load the new parser's code ? HarvestMan should be able
to
work with the new parser, so it should follow the interface rules of
harvestman, i.e
having a "feed()" method to pass in the data and also providing the links in
'links'
attribute and 'images' in images attribute. So mostly you will have to write a
wrapper on top of your original parser code.
A better option might be to support beautifulsoup directly in code and perhaps
even
make it the default (x)html parser. I think I will go for that solution since
it far
easier and quicker to implement.
Btw, if the bug means that harvestman should process the downloaded code, well
this
is not what the program is meant for and it can be done quickly by using events
or
plugins and writing your own handlers. See apps/samples folder for sample
crawlers
using event handlers which perform specific functions like this.
Let me know what you think.
Original comment by abpil...@gmail.com
on 17 Jul 2008 at 8:09
First of all.. this is more like a feature request then a bug.. but that was my
only
option available. Secondly, I already added the reconstruction of the pages
(using
beautifulsoup) as a plugin.. and it work ok (although with some memory
penalties -
that being my first reason of the memory consumption investigation).
I think the plugins are a great way to enhance the functionality of the crawler
and
totally agree we should not mix them with the main purpose of the code. But on
the
other side this could be very a useful feature that could save time for coding
it as
a plugin for lots of users.
If you want, I can send you the code where I used the library's "beautification"
feature as a plugin to be added maybe as an example.
Original comment by andrei.p...@gmail.com
on 17 Jul 2008 at 8:22
That would be nice. Please attach it to the bug and perhaps we can change the
bug
title to "Add support for beautifulsoup HTML parser in HarvestMan" ?
Thanks
--Anand
Original comment by abpil...@gmail.com
on 18 Jul 2008 at 7:31
#!/usr/bin/env python
"""
beautifulsoup_html_reconstruction.py
Demonstrating custom crawler writing by
subscribing to events. This is a crawler which fetches
only web pages from the web and reconstruct the html tags
using beautifulsoup library
Modified by Andrei Pruteanu <andrei dot pruteanu at gmail dot com>
Copyright (C) 2008 Anand B Pillai, Andrei Pruteanu
"""
import sys
import string
import __init__
from apps.harvestmanimp import HarvestMan
from BeautifulSoup import BeautifulSoup
""" reconstruction method """
def reconstruct_page(raw_page):
""" apply BeautifulSoup's prettifyer method """
reconstructed_page = ""
soup = BeautifulSoup(raw_page)
reconstructed_page = soup.prettify()
return reconstructed_page;
""" HtmlCrawler class """
class HtmlCrawler(HarvestMan):
""" A crawler which fetches only HTML (webpage) pages """
def afterParseLinkCB(self,event,*args,**kwargs):
document = event.document
url = event.url
rawPage = str(document)
reconstructedPage = reconstruct_page(rawPage)
def include_this_link(self, event, *args, **kwargs):
url = event.url
if url.is_webpage():
# Allow for further processing by rules...
# otherwise we will end up crawling the entire
# web, since no other rules will apply if we
# return True here.
return None
else:
return False
if __name__ == "__main__":
spider=HtmlCrawler()
spider.initialize()
spider.bind_event('includelinks', spider.include_this_link)
spider.bind_event('afterparse', spider.afterParseLinkCB)
spider.main()
############################################
I agree with the title. What is important is to analyse the performance of the
parser.
A hint for this is:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Where lxml is the best performing parser.. beatifulsoup.. average performance.
Original comment by andrei.p...@gmail.com
on 18 Jul 2008 at 4:59
Will look at this next week...
Original comment by abpil...@gmail.com
on 6 Oct 2008 at 11:36
Original issue reported on code.google.com by
andrei.p...@gmail.com
on 17 Jul 2008 at 7:41