Closed GoogleCodeExporter closed 9 years ago
Can you provide a log and config.xml
Thanks,
Lucas
Original comment by szybal...@gmail.com
on 7 Oct 2008 at 2:42
I'm not using a config.xml My code kind of goes like this...
I'm attempting to create a crawler that...
a. Doesn't store local files, just calls a callback
b. Roams a bit but not much. Ideally 2 or 3 steps from the given url.
The site that always errors is http://newsbiscuit.com with ...
[09:19:31] SGML parse error: unexpected ':' char in declaration
[09:19:31] Error in parsing web-page http://newsbiscuit.com/
[09:19:31] SGML parse error: unexpected ':' char in declaration
[09:19:31] Error in parsing web-page http://newsbiscuit.com/
... but I suspect that in my handle() call, unless a URL is removed from the
list
that it'll just keep trying and get stuck in a loop. Am I right?
cheers
tom
Original comment by remarkability@gmail.com
on 7 Oct 2008 at 8:20
Attachments:
I also notice that I get...
Thread died due to exception => 'HarvestManUrl' object has no attribute
'get_original_url_directory'
a dir(event.url) reveals...
['TEST', '__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__',
'__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__slotnames__', '__str__', '__weakref__', 'absurl',
'anchor', 'anchorcheck', 'baseurl', 'cgi', 'clength', 'compute_dirpaths',
'compute_domain_and_port', 'compute_file_and_dir_paths', 'contentdict',
'defproto',
'dirpath', 'domain', 'fatal', 'filelike', 'filename', 'generation',
'get_anchor',
'get_anchor_url', 'get_base_domain', 'get_base_domain_with_port',
'get_canonical_url', 'get_data_hash', 'get_domain', 'get_domain_hash',
'get_domain_with_port', 'get_download_status', 'get_filename',
'get_full_domain',
'get_full_domain_with_port', 'get_full_filename', 'get_full_url',
'get_full_url_sans_port', 'get_generation', 'get_local_directory',
'get_original_state', 'get_original_url', 'get_parent_url', 'get_port_number',
'get_priority', 'get_relative_depth', 'get_relative_filename',
'get_relative_url',
'get_root_dir', 'get_type', 'get_url', 'get_url_content_info',
'get_url_directory',
'get_url_directory_sans_domain', 'get_url_hash', 'hasextn', 'hashes', 'index',
'is_audio', 'is_cgi', 'is_document', 'is_equal', 'is_filename_url', 'is_image',
'is_multimedia', 'is_relative_path', 'is_relative_to_server', 'is_stylesheet',
'is_video', 'is_webpage', 'isrel', 'isrels', 'lastpath', 'make_document',
'make_valid_filename', 'make_valid_url', 'manage_content_type', 'mindex',
'mirror_url', 'mirrored', 'orig_state', 'origurl', 'pagehash', 'port',
'priority',
'protocol', 'qstatus', 'range', 'rdepth', 'recalc_locations', 'redirected',
'redirected_old', 'reduce_url', 'reresolved', 'reset', 'resolve_protocol',
'resolveurl', 'rootdir', 'rpath', 'rulescheckdone', 'set_directory_url',
'set_url_content_info', 'starturl', 'status', 'trymultipart', 'typ', 'url',
'urlflag', 'validfilename', 'violates_rules', 'violatesrules',
'wrapper_resolveurl']
Original comment by remarkability@gmail.com
on 7 Oct 2008 at 8:48
I added ...
def get_original_url_directory(self):
return self.get_url_directory()
to urlparser.py assuming that this was a minor oversight (and it seems to have
fixed
that particular glitch), but I'm still getting SGML errors... at this point in
crawler.py
except (SGMLParseError, IOError), e:
error('SGML parse error:',str(e))
error('Error in parsing web-page %s' % self.url)
... and somehow the current url gets re-tried or not flushed or something...
Original comment by remarkability@gmail.com
on 7 Oct 2008 at 9:08
I would imagine that the SGML parser failing to parse a page will be fairly
common...
and in many ways, if it fails, I don't want anything to do with it. Adding a ...
return
...under ...
error('Error in parsing web-page %s' % self.url)
...has the symptoms of stopping a major loop happening. I wonder what nasty
side-effects I'm introducing though... seems a bit of a hacky fix.
Original comment by remarkability@gmail.com
on 7 Oct 2008 at 9:20
...this didn't take long...
10:34:55] Failed to download URL
http://www.feedblitz.com/f/f.fbz?AddNewUserDirect
[10:34:55] Not Found => http://www.blogger.com/redirect/next_blog.pyra
[10:34:55] Failed to download URL http://www.blogger.com/redirect/next_blog.pyra
[10:35:05] Ending Project mnogo ...
Exception received=> 'NoneType' object has no attribute 'status'
Printing error traceback for debugging...
File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 478, in run_projects
File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 518, in run_project
File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/app
s/harvestman.py",
line 145, in finish_project
File
"/Users/everythingability/harvestman-crawler-read-only/HarvestMan/harvestman/lib
/datamgr.py",
line 462, in post_download_setup
HarvestMan session finished.
[Errno 24] Too many open files: '.bidx26453136'
[Errno 24] Too many open files: '.bidx26454480'
DONE!
..note to self, don't hack code you don't understand :-)
Original comment by remarkability@gmail.com
on 7 Oct 2008 at 10:27
Accepting for this week-end's marathon hack on harvestman.
Original comment by abpil...@gmail.com
on 9 Oct 2008 at 8:08
I checked this issue by directly crawling http://newbiscuit.com and I dont see
any
loop. The parsing loop is like this.
1. The default parser is the Python sgmllib based parser. First the data is
parsed
with this.
2. If a parse error is found (that is from where the "Error in parsing...")
comes,
then we try parsing with the sgmlop based parser once.
3. If this parser works, fine. If an error is still produced, we break out of
the loop.
So the while loop in parsing can never produce an infinite (perma) loop.
I tried with your example (after a few fixes in code) and it worked out
alright. In
fact in this case the Python parser is failing and the sgmlop parser is working
- you
get logs for the fail but no logs telling that the second attempt by the sgmlop
parser has gone through. Hey in fact, if parsing did not work you won't be able
to
parse the first page and the crawl won't proceed anyway. I see it was able to
crawl a
lot of pages before I manually killed it.
You also have a mistake in code. cfg.datamode should be 1 for simulation (1 is
for in
memory, default is temp-files, i.e 0), since we don't save any files, including
temp
files for simulation. So if crawl has to proceed beyond first page for a
simulated
crawl, the datamode shoule be 0, or CONNECTOR_DATA_MODE_INMEM. See
ext/simulator.py .
Closing this bug as fixed.
Original comment by abpil...@gmail.com
on 11 Oct 2008 at 9:47
btw, I had got this error earlier.
[03:10:29] Starting project mnogo ...
[03:10:29] Writing Project Files...
[03:10:29] Starting download of url http://newsbiscuit.com ...
[03:10:30] Downloading file for url http://newsbiscuit.com/
[03:10:43] SGML parse error: unexpected ':' char in declaration
[03:10:43] Error in parsing web-page http://newsbiscuit.com/
TITLE: NewsBiscuit
URL: http://newsbiscuit.com/
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION: global name 'descape' is not defined
...
I see "descape" in the code, but where is this defined ?
I removed this piece to test the code.
Here is the full log of my run with example.py, before I killed the crawler
explicitly.
--------------------------------------------------------------------------
anand@anand-laptop:~/projects/harvestman/HarvestMan-trunk$ python example.py
http://newsbiscuit.com
usage: python example.py http://newsbiscuit.com
['example.py', 'http://newsbiscuit.com']
running http://newsbiscuit.com
[]
[03:13:02] *** Log Started ***
[03:13:02] Starting project mnogo ...
[03:13:02] Writing Project Files...
[03:13:02] Starting download of url http://newsbiscuit.com ...
[03:13:02] Downloading file for url http://newsbiscuit.com/
[03:13:12] SGML parse error: unexpected ':' char in declaration
[03:13:12] Error in parsing web-page http://newsbiscuit.com/
TITLE: NewsBiscuit
URL: http://newsbiscuit.com/
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 57
[03:13:12] Fetching links for url http://newsbiscuit.com/
[03:13:12] Not Found => http://newsbiscuit.com/robots.txt
[03:13:13] Downloading file for url http://newsbiscuit.com/rss
[03:13:13] Downloading file for url
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
[03:13:13] Downloading file for url
http://newsbiscuit.com/category/arts-and-entertainment
[03:13:13] Not Found => http://fusion.google.com/robots.txt
[03:13:14] Downloading file for url http://newsbiscuit.com/category/health
[03:13:15] SGML parse error: expected name token at '<!::debug:: archive '
[03:13:15] Error in parsing web-page
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
TITLE: NewsBiscuit: ‘People pre-judge me because I look like Hitler’
URL:
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 63
TITLE: NewsBiscuit: Arts/Entertainment
URL: http://newsbiscuit.com/category/arts-and-entertainment
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:15] Fetching links for url
http://newsbiscuit.com/article/feature-people-pre-judge-me-because-i-look-like-h
itler-218
[03:13:16] SGML parse error: expected name token at '<!::debug:: category'
[03:13:16] Error in parsing web-page http://newsbiscuit.com/category/health
TITLE: NewsBiscuit: Health
URL: http://newsbiscuit.com/category/health
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:16] Downloading file for url http://newsbiscuit.com/category/features
[03:13:16] Downloading file for url
http://newsbiscuit.com/about/you-write-the-news
[03:13:16] Fetching links for url http://newsbiscuit.com/category/health
[03:13:17] Downloading file for url http://newsbiscuit.com/category/sport
[03:13:17] Fetching links for url
http://newsbiscuit.com/category/arts-and-entertainment
TITLE: NewsBiscuit: Features
URL: http://newsbiscuit.com/category/features
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:18] Not Found => http://clkuk.tradedoubler.com/robots.txt
TITLE: NewsBiscuit: You write the news...
URL: http://newsbiscuit.com/about/you-write-the-news
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 50
[03:13:18] Downloading file for url http://newsbiscuit.com/category/education
[03:13:18] Fetching links for url
http://newsbiscuit.com/about/you-write-the-news
TITLE: NewsBiscuit: Sport
URL: http://newsbiscuit.com/category/sport
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:19] Downloading file for url http://newsbiscuit.com/category/world-news
[03:13:19] Fetching links for url http://newsbiscuit.com/category/sport
[03:13:19] Downloading file for url http://newsbiscuit.com/category/politics
[03:13:20] Fetching links for url http://newsbiscuit.com/category/features
TITLE: NewsBiscuit: Education
URL: http://newsbiscuit.com/category/education
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:20] Fetching links for url http://newsbiscuit.com/category/education
[03:13:20] Downloading file for url
http://newsbiscuit.com/about/advertise-on-newsbiscuit
TITLE: NewsBiscuit: World News
URL: http://newsbiscuit.com/category/world-news
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:20] Fetching links for url http://newsbiscuit.com/category/world-news
TITLE: NewsBiscuit: Politics
URL: http://newsbiscuit.com/category/politics
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:21] Downloading file for url http://newsbiscuit.com/about/login
[03:13:21] Downloading file for url http://newsbiscuit.com/about/faq
[03:13:21] Fetching links for url http://newsbiscuit.com/category/politics
TITLE: NewsBiscuit: Advertise on NewsBiscuit
URL: http://newsbiscuit.com/about/advertise-on-newsbiscuit
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49
[03:13:22] Downloading file for url http://astore.amazon.co.uk/newsbiscuit-21
[03:13:22] Fetching links for url
http://newsbiscuit.com/about/advertise-on-newsbiscuit
TITLE: NewsBiscuit: Frequently Asked Questions
URL: http://newsbiscuit.com/about/faq
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66
[03:13:23] Fetching links for url http://newsbiscuit.com/about/faq
TITLE: NewsBiscuit Store - shopping here helps pay for NewsBiscuit - Home Page
URL: http://astore.amazon.co.uk/newsbiscuit-21
LAST MODIFIED:
keywords error list index out of range
KEYWORDS:
DESCRIPTION:
CONTENT_TYPE: text/html; charset=UTF-8
NUM OF LINKS: 20
[03:13:23] Downloading file for url http://newsbiscuit.com/category/celebrity
[03:13:23] Downloading file for url
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
[03:13:24] Not Found => http://www.newsbiscuit.com/robots.txt
TITLE: NewsBiscuit: Login
URL: http://newsbiscuit.com/about/login
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49
[03:13:24] Downloading file for url
http://newsbiscuit.com/article/poland-overwhelmed-by-influx-of-british-investmen
t-bankers-379
TITLE: NewsBiscuit: Celebrity
URL: http://newsbiscuit.com/category/celebrity
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:25] Forbidden => http://en.wikipedia.org/robots.txt
[03:13:26] Downloading file for url
http://newsbiscuit.com/article/global-financial-meltdown-averted-as-drunk-points
-out-its-all-just-made-up-numbers-innit
TITLE: NewsBiscuit: Pools panel declare War on Terror ‘Away Win’
URL:
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67
TITLE: NewsBiscuit: Poland overwhelmed by influx of British investment bankers
URL:
http://newsbiscuit.com/article/poland-overwhelmed-by-influx-of-british-investmen
t-bankers-379
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66
[03:13:26] Downloading file for url
http://newsbiscuit.com/article/sat-navs-starting-to-chat-make-racist-comments-37
8
[03:13:26] Downloading file for url
http://newsbiscuit.com/article/government-steps-in-to-avoid-bankruptcy-in-family
-game-of-monopoly-383
TITLE: NewsBiscuit: Global financial meltdown averted as drunk points out
‘It's all
just made up numbers innit?’
URL:
http://newsbiscuit.com/article/global-financial-meltdown-averted-as-drunk-points
-out-its-all-just-made-up-numbers-innit
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67
[03:13:29] Downloading file for url
http://newsbiscuit.com/about/terms-and-conditions
TITLE: NewsBiscuit: Government steps in to avoid bankruptcy in family game of
Monopoly
URL:
http://newsbiscuit.com/article/government-steps-in-to-avoid-bankruptcy-in-family
-game-of-monopoly-383
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 67
[03:13:29] Downloading file for url http://newsbiscuit.com/rss/
[03:13:30] Downloading file for url http://www.del.co.uk/
TITLE: NewsBiscuit: Sat Navs starting to chat, make racist comments
URL:
http://newsbiscuit.com/article/sat-navs-starting-to-chat-make-racist-comments-37
8
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 66
TITLE: NewsBiscuit: Terms and Conditions
URL: http://newsbiscuit.com/about/terms-and-conditions
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 49
[03:13:32] Downloading file for url
http://newsbiscuit.com/category/science-and-technology
[03:13:32] Downloading file for url http://newsbiscuit.com/category/uk-news
TITLE: NewsBiscuit: Science
URL: http://newsbiscuit.com/category/science-and-technology
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:34] Downloading file for url http://www.humorfeed.com/
[03:13:34] Fetching links for url
http://newsbiscuit.com/article/pools-panel-declare-war-on-terror-away-win-381
TITLE: Deluxe Corporation
URL: http://del.co.uk/welcome-to-deluxe-corporation.html
LAST MODIFIED: Thu, 03 Jul 2008 15:26:13 GMT
KEYWORDS: deluxe corporation
DESCRIPTION: Deluxe Corporation - Digital Creativity. Ultimate design and
rock-solid
reliability in TV, DVD, video, streaming services, second life and multimedia.
CONTENT_TYPE: text/html; charset=UTF-8
NUM OF LINKS: 14
[03:13:34] Downloading file for url
http://newsbiscuit.com/about/about-newsbiscuit
TITLE: NewsBiscuit: UK News
URL: http://newsbiscuit.com/category/uk-news
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:36] Downloading file for url http://newsbiscuit.com/category/business
TITLE: NewsBiscuit: About NewsBiscuit
URL: http://newsbiscuit.com/about/about-newsbiscuit
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 53
[03:13:36] Fetching links for url http://astore.amazon.co.uk/newsbiscuit-21
[03:13:36] Fetching links for url http://newsbiscuit.com/about/about-newsbiscuit
[03:13:37] Downloading file for url http://newsbiscuit.com/category/environment
TITLE: NewsBiscuit: Business
URL: http://newsbiscuit.com/category/business
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:38] Ending Project mnogo ...
TITLE: Humorfeed - Your Satire News Source
URL: http://www.humorfeed.com/
LAST MODIFIED:
keywords error list index out of range
KEYWORDS:
DESCRIPTION:
CONTENT_TYPE: text/html
NUM OF LINKS: 56
TITLE: NewsBiscuit: Environment
URL: http://newsbiscuit.com/category/environment
LAST MODIFIED:
KEYWORDS: newsbiscuit
DESCRIPTION:
CONTENT_TYPE: text/html; charset=utf-8
NUM OF LINKS: 60
[03:13:41]
[03:13:41]
[03:13:41] HarvestMan crawl simulation of mnogo completed in 38.92 seconds.
[03:13:41] 266 links scanned in 12 servers .
[03:13:41] No file written.
[03:13:41] 603581 bytes received at the rate of 15.14 KB/sec .
[03:13:41] *** Log Completed ***
HarvestMan session finished.
DONE!
--------------------------------------------------------------------------
Original comment by abpil...@gmail.com
on 11 Oct 2008 at 9:49
Ok. I added a log message when a page is re-parsed successfully a 2nd time using
sgmlop parser. To see it u need to run the logger at verbosity level of
EXTRAINFO.
cfg.add(url, 'mnogo', '/tmp', verbosity="extrainfo")
Now u see,
[03:35:54] Starting download of url http://newsbiscuit.com ...
[03:35:54] Downloading file for url http://newsbiscuit.com/
[03:35:57] Html filter prevents download of url => http://newsbiscuit.com/
[03:35:57] Parsing web page http://newsbiscuit.com/
Parse count => 1
[03:35:57] SGML parse error: unexpected ':' char in declaration
[03:35:57] Error in parsing web-page http://newsbiscuit.com/
Parse count => 2
[03:35:57] Parsed web page successfully in second attempt
http://newsbiscuit.com/
...
So, I guess that completes the "fix" and makes it clear what is happening :)
Original comment by abpil...@gmail.com
on 11 Oct 2008 at 10:09
Original issue reported on code.google.com by
remarkability@gmail.com
on 6 Oct 2008 at 9:26