HTML code reconstruction library to be added optionally - beautifullsoup for example

GoogleCodeExporter commented 8 years ago

Current behaviour:

The downloaded code is left unprocessed. There are a lot of bad written web
pages that can cause parsers to fail when trying to extract useful information.

Desired behaviour:

Optional enabling in xml of the library to fix/reconstruct the html/js code.

Example library:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

Original issue reported on code.google.com by andrei.p...@gmail.com on 17 Jul 2008 at 7:41

GoogleCodeExporter commented 8 years ago

My reading of this bug is that you would like to have the ability to plug-in a
parser(such as beautifulsoup) of your own for performing parsing of web-pages,
instead of getting stuck to the default parser.

Is that correct ? HarvestMan does provide the sgmlop-based parser which can
circumvent (perhaps not fix) bad (X)HTML, but this could be useful I think.

However how would this interfacing look like ? Will you specify a module and a
class-name from where to load the new parser's code ? HarvestMan should be able 
to
work with the new parser, so it should follow the interface rules of 
harvestman, i.e
having a "feed()" method to pass in the data and also providing the links in 
'links'
attribute and 'images' in images attribute. So mostly you will have to write a
wrapper on top of your original parser code.

A better option might be to support beautifulsoup directly in code and perhaps 
even
make it the default (x)html parser. I think I will go for that solution since 
it far
easier and quicker to implement.

Btw, if the bug means that harvestman should process the downloaded code, well 
this
is not what the program is meant for and it can be done quickly by using events 
or
plugins and writing your own handlers. See apps/samples folder for sample 
crawlers
using event handlers which perform specific functions like this.

Let me know what you think.

Original comment by abpil...@gmail.com on 17 Jul 2008 at 8:09

GoogleCodeExporter commented 8 years ago

First of all.. this is more like a feature request then a bug.. but that was my 
only
option available. Secondly, I already added the reconstruction of the pages 
(using
beautifulsoup) as a plugin.. and it work ok (although with some memory 
penalties -
that being my first reason of the memory consumption investigation). 

I think the plugins are a great way to enhance the functionality of the crawler 
and
totally agree we should not mix them with the main purpose of the code. But on 
the
other side this could be very a useful feature that could save time for coding 
it as
a plugin for lots of users.

If you want, I can send you the code where I used the library's "beautification"
feature as a plugin to be added maybe as an example.

Original comment by andrei.p...@gmail.com on 17 Jul 2008 at 8:22

GoogleCodeExporter commented 8 years ago

That would be nice. Please attach it to the bug and perhaps we can change the 
bug
title to "Add support for beautifulsoup HTML parser in HarvestMan" ?

Thanks

--Anand

Original comment by abpil...@gmail.com on 18 Jul 2008 at 7:31

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

#!/usr/bin/env python

"""
beautifulsoup_html_reconstruction.py 
Demonstrating custom crawler writing by
subscribing to events. This is a crawler which fetches
only web pages from the web and reconstruct the html tags
using beautifulsoup library

Modified by Andrei Pruteanu <andrei dot pruteanu at gmail dot com>

Copyright (C) 2008 Anand B Pillai, Andrei Pruteanu
"""

import sys
import string
import __init__
from apps.harvestmanimp import HarvestMan
from BeautifulSoup import BeautifulSoup

""" reconstruction method """
def reconstruct_page(raw_page):
    """ apply BeautifulSoup's prettifyer method """
    reconstructed_page = ""

    soup = BeautifulSoup(raw_page)

    reconstructed_page = soup.prettify()

    return reconstructed_page;

""" HtmlCrawler class """
class HtmlCrawler(HarvestMan):
    """ A crawler which fetches only HTML (webpage) pages """

    def afterParseLinkCB(self,event,*args,**kwargs):
    document = event.document
    url = event.url
    rawPage = str(document)
    reconstructedPage = reconstruct_page(rawPage)

    def include_this_link(self, event, *args, **kwargs):

        url = event.url
        if url.is_webpage():
            # Allow for further processing by rules...
            # otherwise we will end up crawling the entire
            # web, since no other rules will apply if we
            # return True here.
            return None
        else:
            return False

if __name__ == "__main__":
    spider=HtmlCrawler()
    spider.initialize()
    spider.bind_event('includelinks', spider.include_this_link)
    spider.bind_event('afterparse', spider.afterParseLinkCB)
    spider.main()

############################################

I agree with the title. What is important is to analyse the performance of the 
parser.

A hint for this is:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Where lxml is the best performing parser.. beatifulsoup.. average performance.

Original comment by andrei.p...@gmail.com on 18 Jul 2008 at 4:59

GoogleCodeExporter commented 8 years ago

Will look at this next week...

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:36

spritt82 / harvestman-crawler

HTML code reconstruction library to be added optionally - beautifullsoup for example #15