spritt82 / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

Permaloop? #25

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Not sure...

What is the expected output? What do you see instead?

I get a permanent loop in a subclass of HarvestMan that says...
[22:21:18] SGML parse error: unexpected ':' char in declaration
[22:21:18] Error in parsing web-page http://newsbiscuit.com/ 

... is this you or me, or should def handle(self, event, *args, **kwargs):
perhaps take the erroring url out of the to do list?

thanks

tom

Original issue reported on code.google.com by remarkability@gmail.com on 6 Oct 2008 at 9:26

GoogleCodeExporter commented 9 years ago
Can you provide a log and config.xml

Thanks,
Lucas

Original comment by szybal...@gmail.com on 7 Oct 2008 at 2:42

GoogleCodeExporter commented 9 years ago
I'm not using a config.xml My code kind of goes like this...

I'm attempting to create a crawler that...
a. Doesn't store local files, just calls a callback
b. Roams a bit but not much. Ideally 2 or 3 steps from the given url.

The site that always errors is http://newsbiscuit.com with ...

[09:19:31] SGML parse error: unexpected ':' char in declaration
[09:19:31] Error in parsing web-page http://newsbiscuit.com/ 
[09:19:31] SGML parse error: unexpected ':' char in declaration
[09:19:31] Error in parsing web-page http://newsbiscuit.com/ 

... but I suspect that in my handle() call, unless a URL is removed from the 
list
that it'll just keep trying and get stuck in a loop. Am I right?

cheers

tom

Original comment by remarkability@gmail.com on 7 Oct 2008 at 8:20

Attachments:

GoogleCodeExporter commented 9 years ago
I also notice that I get...

Thread died due to exception =>  'HarvestManUrl' object has no attribute
'get_original_url_directory'

a dir(event.url)  reveals...

['TEST', '__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__',
'__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__slotnames__', '__str__', '__weakref__', 'absurl',
'anchor', 'anchorcheck', 'baseurl', 'cgi', 'clength', 'compute_dirpaths',
'compute_domain_and_port', 'compute_file_and_dir_paths', 'contentdict', 
'defproto',
'dirpath', 'domain', 'fatal', 'filelike', 'filename', 'generation', 
'get_anchor',
'get_anchor_url', 'get_base_domain', 'get_base_domain_with_port',
'get_canonical_url', 'get_data_hash', 'get_domain', 'get_domain_hash',
'get_domain_with_port', 'get_download_status', 'get_filename', 
'get_full_domain',
'get_full_domain_with_port', 'get_full_filename', 'get_full_url',
'get_full_url_sans_port', 'get_generation', 'get_local_directory',
'get_original_state', 'get_original_url', 'get_parent_url', 'get_port_number',
'get_priority', 'get_relative_depth', 'get_relative_filename', 
'get_relative_url',
'get_root_dir', 'get_type', 'get_url', 'get_url_content_info', 
'get_url_directory',
'get_url_directory_sans_domain', 'get_url_hash', 'hasextn', 'hashes', 'index',
'is_audio', 'is_cgi', 'is_document', 'is_equal', 'is_filename_url', 'is_image',
'is_multimedia', 'is_relative_path', 'is_relative_to_server', 'is_stylesheet',
'is_video', 'is_webpage', 'isrel', 'isrels', 'lastpath', 'make_document',
'make_valid_filename', 'make_valid_url', 'manage_content_type', 'mindex',
'mirror_url', 'mirrored', 'orig_state', 'origurl', 'pagehash', 'port', 
'priority',
'protocol', 'qstatus', 'range', 'rdepth', 'recalc_locations', 'redirected',
'redirected_old', 'reduce_url', 'reresolved', 'reset', 'resolve_protocol',
'resolveurl', 'rootdir', 'rpath', 'rulescheckdone', 'set_directory_url',
'set_url_content_info', 'starturl', 'status', 'trymultipart', 'typ', 'url',
'urlflag', 'validfilename', 'violates_rules', 'violatesrules', 
'wrapper_resolveurl']

Original comment by remarkability@gmail.com on 7 Oct 2008 at 8:48

GoogleCodeExporter commented 9 years ago
I added ...

    def get_original_url_directory(self):
        return self.get_url_directory()

to urlparser.py assuming that this was a minor oversight (and it seems to have 
fixed
that particular glitch), but I'm still getting SGML errors... at this point in 
crawler.py

except (SGMLParseError, IOError), e:
                    error('SGML parse error:',str(e))
                    error('Error in parsing web-page %s' % self.url)

... and somehow the current url gets re-tried or not flushed or something...

Original comment by remarkability@gmail.com on 7 Oct 2008 at 9:08

GoogleCodeExporter commented 9 years ago
I would imagine that the SGML parser failing to parse a page will be fairly 
common...
and in many ways, if it fails, I don't want anything to do with it. Adding a ...

return

...under ...

error('Error in parsing web-page %s' % self.url) 

...has the symptoms of stopping a major loop happening. I wonder what nasty
side-effects I'm introducing though... seems a bit of a hacky fix.

Original comment by remarkability@gmail.com on 7 Oct 2008 at 9:20

GoogleCodeExporter commented 9 years ago
...this didn't take long...

10:34:55] Failed to download URL 
http://www.feedblitz.com/f/f.fbz?AddNewUserDirect
[10:34:55] Not Found =>  http://www.blogger.com/redirect/next_blog.pyra
[10:34:55] Failed to download URL http://www.blogger.com/redirect/next_blog.pyra
[10:35:05] Ending Project mnogo ...
Exception received=> 'NoneType' object has no attribute 'status'
Printing error traceback for debugging...
  File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 478, in run_projects
  File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 518, in run_project
  File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 145, in finish_project
  File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/lib
/datamgr.py",
line 462, in post_download_setup
HarvestMan session finished. 
[Errno 24] Too many open files: '.bidx26453136'
[Errno 24] Too many open files: '.bidx26454480'
DONE!

..note to self, don't hack code you don't understand :-)

Original comment by remarkability@gmail.com on 7 Oct 2008 at 10:27

GoogleCodeExporter commented 9 years ago
Accepting for this week-end's marathon hack on harvestman.

Original comment by abpil...@gmail.com on 9 Oct 2008 at 8:08

GoogleCodeExporter commented 9 years ago
I checked this issue by directly crawling http://newbiscuit.com and I dont see 
any
loop. The parsing loop is like this.

1. The default parser is the Python sgmllib based parser. First the data is 
parsed
with this.
2. If a parse error is found (that is from where the "Error in parsing...") 
comes,
then we try parsing with the sgmlop based parser once. 
3. If this parser works, fine. If an error is still produced, we break out of 
the loop.

So the while loop in parsing can never produce an infinite (perma) loop.

I tried with your example (after a few fixes in code) and it worked out 
alright. In
fact in this case the Python parser is failing and the sgmlop parser is working 
- you
get logs for the fail but no logs telling that the second attempt by the sgmlop
parser has gone through. Hey in fact, if parsing did not work you won't be able 
to
parse the first page and the crawl won't proceed anyway. I see it was able to 
crawl a
lot of pages before I manually killed it.

You also have a mistake in code. cfg.datamode should be 1 for simulation (1 is 
for in
memory, default is temp-files, i.e 0), since we don't save any files, including 
temp
files for simulation. So if crawl has to proceed beyond first page for a 
simulated
crawl, the datamode shoule be 0, or CONNECTOR_DATA_MODE_INMEM. See 
ext/simulator.py .

Closing this bug as fixed.

Original comment by abpil...@gmail.com on 11 Oct 2008 at 9:47

GoogleCodeExporter commented 9 years ago
btw, I had got this error earlier.

[03:10:29] Starting project mnogo ...
[03:10:29] Writing Project Files... 
[03:10:29] Starting download of url http://newsbiscuit.com ...
[03:10:30] Downloading file for url http://newsbiscuit.com/
[03:10:43] SGML parse error: unexpected ':' char in declaration
[03:10:43] Error in parsing web-page http://newsbiscuit.com/ 
TITLE: NewsBiscuit
URL: http://newsbiscuit.com/
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: global name 'descape' is not defined

...
I see "descape" in the code, but where is this defined ?

I removed this piece to test the code. 

Here is the full log of my run with example.py, before I killed the crawler 
explicitly.

--------------------------------------------------------------------------

anand@anand-laptop:~/projects/harvestman/HarvestMan-trunk$ python example.py
http://newsbiscuit.com 
usage: python example.py http://newsbiscuit.com
['example.py', 'http://newsbiscuit.com']
running http://newsbiscuit.com
[]
[03:13:02] *** Log Started ***

[03:13:02] Starting project mnogo ...
[03:13:02] Writing Project Files... 
[03:13:02] Starting download of url http://newsbiscuit.com ...
[03:13:02] Downloading file for url http://newsbiscuit.com/
[03:13:12] SGML parse error: unexpected ':' char in declaration
[03:13:12] Error in parsing web-page http://newsbiscuit.com/ 
TITLE: NewsBiscuit
URL: http://newsbiscuit.com/
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 57

[03:13:12] Fetching links for url http://newsbiscuit.com/
[03:13:12] Not Found =>  http://newsbiscuit.com/robots.txt
[03:13:13] Downloading file for url http://newsbiscuit.com/rss
[03:13:13] Downloading file for url
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
[03:13:13] Downloading file for url
http://newsbiscuit.com/category/arts-and-entertainment
[03:13:13] Not Found =>  http://fusion.google.com/robots.txt
[03:13:14] Downloading file for url http://newsbiscuit.com/category/health
[03:13:15] SGML parse error: expected name token at '<!::debug:: archive '
[03:13:15] Error in parsing web-page
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218

TITLE: NewsBiscuit: ‘People pre-judge me because I look like Hitler’
URL:
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 63

TITLE: NewsBiscuit: Arts/Entertainment
URL: http://newsbiscuit.com/category/arts-and-entertainment
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:15] Fetching links for url
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
[03:13:16] SGML parse error: expected name token at '<!::debug:: category'
[03:13:16] Error in parsing web-page http://newsbiscuit.com/category/health 
TITLE: NewsBiscuit: Health
URL: http://newsbiscuit.com/category/health
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:16] Downloading file for url http://newsbiscuit.com/category/features
[03:13:16] Downloading file for url 
http://newsbiscuit.com/about/you-write-the-news
[03:13:16] Fetching links for url http://newsbiscuit.com/category/health
[03:13:17] Downloading file for url http://newsbiscuit.com/category/sport
[03:13:17] Fetching links for url 
http://newsbiscuit.com/category/arts-and-entertainment
TITLE: NewsBiscuit: Features
URL: http://newsbiscuit.com/category/features
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:18] Not Found =>  http://clkuk.tradedoubler.com/robots.txt
TITLE: NewsBiscuit: You write the news...
URL: http://newsbiscuit.com/about/you-write-the-news
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 50

[03:13:18] Downloading file for url http://newsbiscuit.com/category/education
[03:13:18] Fetching links for url 
http://newsbiscuit.com/about/you-write-the-news
TITLE: NewsBiscuit: Sport
URL: http://newsbiscuit.com/category/sport
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:19] Downloading file for url http://newsbiscuit.com/category/world-news
[03:13:19] Fetching links for url http://newsbiscuit.com/category/sport
[03:13:19] Downloading file for url http://newsbiscuit.com/category/politics
[03:13:20] Fetching links for url http://newsbiscuit.com/category/features
TITLE: NewsBiscuit: Education
URL: http://newsbiscuit.com/category/education
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:20] Fetching links for url http://newsbiscuit.com/category/education
[03:13:20] Downloading file for url 
http://newsbiscuit.com/about/advertise-on-newsbiscuit
TITLE: NewsBiscuit: World News
URL: http://newsbiscuit.com/category/world-news
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:20] Fetching links for url http://newsbiscuit.com/category/world-news
TITLE: NewsBiscuit: Politics
URL: http://newsbiscuit.com/category/politics
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:21] Downloading file for url http://newsbiscuit.com/about/login
[03:13:21] Downloading file for url http://newsbiscuit.com/about/faq
[03:13:21] Fetching links for url http://newsbiscuit.com/category/politics
TITLE: NewsBiscuit: Advertise on NewsBiscuit
URL: http://newsbiscuit.com/about/advertise-on-newsbiscuit
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49

[03:13:22] Downloading file for url http://astore.amazon.co.uk/newsbiscuit-21
[03:13:22] Fetching links for url 
http://newsbiscuit.com/about/advertise-on-newsbiscuit
TITLE: NewsBiscuit: Frequently Asked Questions
URL: http://newsbiscuit.com/about/faq
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66

[03:13:23] Fetching links for url http://newsbiscuit.com/about/faq
TITLE: NewsBiscuit Store - shopping here helps pay for NewsBiscuit - Home Page
URL: http://astore.amazon.co.uk/newsbiscuit-21
LAST MODIFIED: 
keywords error list index out of range
KEYWORDS: 
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=UTF-8
NUM OF LINKS: 20

[03:13:23] Downloading file for url http://newsbiscuit.com/category/celebrity
[03:13:23] Downloading file for url
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
[03:13:24] Not Found =>  http://www.newsbiscuit.com/robots.txt
TITLE: NewsBiscuit: Login
URL: http://newsbiscuit.com/about/login
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49

[03:13:24] Downloading file for url
http://newsbiscuit.com/article/poland-overwhelmed-by-influx-of-british-investmen
t-bankers-379
TITLE: NewsBiscuit: Celebrity
URL: http://newsbiscuit.com/category/celebrity
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:25] Forbidden =>  http://en.wikipedia.org/robots.txt
[03:13:26] Downloading file for url
http://newsbiscuit.com/article/global-financial-meltdown-averted-as-drunk-points
-out-its-all-just-made-up-numbers-innit
TITLE: NewsBiscuit: Pools panel declare War on Terror ‘Away Win’
URL: 
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67

TITLE: NewsBiscuit: Poland overwhelmed by influx of British investment bankers
URL:
http://newsbiscuit.com/article/poland-overwhelmed-by-influx-of-british-investmen
t-bankers-379
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66

[03:13:26] Downloading file for url
http://newsbiscuit.com/article/sat-navs-starting-to-chat-make-racist-comments-37
8
[03:13:26] Downloading file for url
http://newsbiscuit.com/article/government-steps-in-to-avoid-bankruptcy-in-family
-game-of-monopoly-383
TITLE: NewsBiscuit: Global financial meltdown averted as drunk points out 
‘It's all
just made up numbers innit?’
URL:
http://newsbiscuit.com/article/global-financial-meltdown-averted-as-drunk-points
-out-its-all-just-made-up-numbers-innit
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67

[03:13:29] Downloading file for url 
http://newsbiscuit.com/about/terms-and-conditions
TITLE: NewsBiscuit: Government steps in to avoid bankruptcy in family game of 
Monopoly
URL:
http://newsbiscuit.com/article/government-steps-in-to-avoid-bankruptcy-in-family
-game-of-monopoly-383
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67

[03:13:29] Downloading file for url http://newsbiscuit.com/rss/
[03:13:30] Downloading file for url http://www.del.co.uk/
TITLE: NewsBiscuit: Sat Navs starting to chat, make racist comments
URL: 
http://newsbiscuit.com/article/sat-navs-starting-to-chat-make-racist-comments-37
8
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66

TITLE: NewsBiscuit: Terms and Conditions
URL: http://newsbiscuit.com/about/terms-and-conditions
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49

[03:13:32] Downloading file for url
http://newsbiscuit.com/category/science-and-technology
[03:13:32] Downloading file for url http://newsbiscuit.com/category/uk-news
TITLE: NewsBiscuit: Science
URL: http://newsbiscuit.com/category/science-and-technology
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:34] Downloading file for url http://www.humorfeed.com/
[03:13:34] Fetching links for url
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
TITLE: Deluxe Corporation
URL: http://del.co.uk/welcome-to-deluxe-corporation.html
LAST MODIFIED: Thu, 03 Jul 2008 15:26:13 GMT
KEYWORDS: deluxe corporation
DESCRIPTION: Deluxe Corporation - Digital Creativity. Ultimate design and 
rock-solid
reliability in TV, DVD, video, streaming services, second life and multimedia.
CONTENT_TYPE: text/html; charset=UTF-8
NUM OF LINKS: 14

[03:13:34] Downloading file for url 
http://newsbiscuit.com/about/about-newsbiscuit
TITLE: NewsBiscuit: UK News
URL: http://newsbiscuit.com/category/uk-news
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:36] Downloading file for url http://newsbiscuit.com/category/business
TITLE: NewsBiscuit: About NewsBiscuit
URL: http://newsbiscuit.com/about/about-newsbiscuit
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 53

[03:13:36] Fetching links for url http://astore.amazon.co.uk/newsbiscuit-21
[03:13:36] Fetching links for url http://newsbiscuit.com/about/about-newsbiscuit
[03:13:37] Downloading file for url http://newsbiscuit.com/category/environment
TITLE: NewsBiscuit: Business
URL: http://newsbiscuit.com/category/business
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:38] Ending Project mnogo ...
TITLE: Humorfeed - Your Satire News Source
URL: http://www.humorfeed.com/
LAST MODIFIED: 
keywords error list index out of range
KEYWORDS: 
DESCRIPTION: 
CONTENT_TYPE: text/html
NUM OF LINKS: 56

TITLE: NewsBiscuit: Environment
URL: http://newsbiscuit.com/category/environment
LAST MODIFIED: 
KEYWORDS: newsbiscuit
DESCRIPTION: 
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60

[03:13:41]   
[03:13:41]   
[03:13:41] HarvestMan crawl simulation of mnogo completed in 38.92 seconds.
[03:13:41] 266 links scanned in 12 servers .
[03:13:41] No file written. 
[03:13:41] 603581  bytes received at the rate of 15.14 KB/sec .
[03:13:41] *** Log Completed ***

HarvestMan session finished. 
DONE!
--------------------------------------------------------------------------

Original comment by abpil...@gmail.com on 11 Oct 2008 at 9:49

GoogleCodeExporter commented 9 years ago
Ok. I added a log message when a page is re-parsed successfully a 2nd time using
sgmlop parser. To see it u need to run the logger at verbosity level of 
EXTRAINFO.

cfg.add(url, 'mnogo', '/tmp', verbosity="extrainfo")

Now u see,

[03:35:54] Starting download of url http://newsbiscuit.com ...
[03:35:54] Downloading file for url http://newsbiscuit.com/
[03:35:57] Html filter prevents download of url => http://newsbiscuit.com/
[03:35:57] Parsing web page http://newsbiscuit.com/
Parse count => 1
[03:35:57] SGML parse error: unexpected ':' char in declaration
[03:35:57] Error in parsing web-page http://newsbiscuit.com/ 
Parse count => 2
[03:35:57] Parsed web page successfully in second attempt 
http://newsbiscuit.com/
...

So, I guess that completes the "fix" and makes it clear what is happening :)

Original comment by abpil...@gmail.com on 11 Oct 2008 at 10:09