Open GoogleCodeExporter opened 8 years ago
Please see my comment to your report on issue #4. Can you verify this still
persists
after fix for #6 ?
Original comment by abpil...@gmail.com
on 17 Jul 2008 at 5:16
Still the crawler hangs:
1. ps aux:
8696 46.0 31.1 898852 631948 pts/1 Sl+ 21:46 13:51 python
/usr/lib/python2.5/site-packages/harvestman/apps/harvestman.py -C
config-sample.xml
at about 30% of 2GB memory, after 30 minutes.
2. version number:
svn up:
At revision 79.
3. xml file:
xml config file contained <connections type="flush" />
4. number of tests: 2
5. ~ time from start to hanging: 30 minutes
Original comment by andrei.p...@gmail.com
on 17 Jul 2008 at 7:19
Saw your comment on #6. Thanks for reporting this!
Please update the bug with the config.xml you used for this crawl. I will test
it out
and work out a fix.
Please attach the file to the bug, do not copy/paste it.
Thanks!
Original comment by abpil...@gmail.com
on 17 Jul 2008 at 7:42
Just to point out, this does not look like an issue with memory, but a logic
flaw in
the state monitor. The state monitor is responsible for deciding when to exit
the
program. I think it is not flawless and does not have a bail-out logic if it
finds
the crawler cannot finish the crawl... this needs to be fixed.
Original comment by abpil...@gmail.com
on 17 Jul 2008 at 7:56
I would split the problem in two:
1. memory consumption increases linearly on large websites (using patched
version also)
2. the program does not exit when it runs out of memory and hangs.
Both of them seem important to me.
I tested again, this time after a system reboot. Now the available memory was
larger
so the crawler ate about 65% of it, and hanged after ~ 1 hour. Please check the
attached configuration file.
Original comment by andrei.p...@gmail.com
on 17 Jul 2008 at 9:27
Attachments:
Thanks andrei. I did not get an update from google for this comment, I wonder
why...
Original comment by abpil...@gmail.com
on 18 Jul 2008 at 7:29
Original comment by abpil...@gmail.com
on 18 Jul 2008 at 7:30
I found a way to limit the effects of the memory consumption by specifying the
time
for the crawl (restricted to 30 min.). (the xml config file has such an option).
If anyone is having the same problem.. this could be useful.
Original comment by andrei.p...@gmail.com
on 24 Jul 2008 at 4:26
I think a good way to solve this is architectural change - i.e splitting the
crawler
to 2 processes. A client/server paradigm over the current single process is the
first
starting step. This is addressed in (updated) issue #18...
Original comment by abpil...@gmail.com
on 6 Oct 2008 at 11:23
I did a test with a crawler made by me (multi-threaded etc) and did not
encounter the
same memory consumption problem. I think it might be an architectural problem.
But if
the implementation is to be changed radically, we should skip this bug
(although is
quite important in my opinion).
Original comment by andrei.p...@gmail.com
on 13 Oct 2008 at 8:00
I checked in a fix for the module bst the other day. A lot of leak is coming
from the
use of dictcache in this module to store/load URL objects to/from the disk using
cPickle. I replaced this with Bsddb btree records, so the memory usage should
improve
a bit. Still this is a work in progress...
Original comment by abpil...@gmail.com
on 12 Jan 2009 at 8:22
This needs testing - i will be running large crawl tests for a couple of days
and
closing the bug if memory is fine at the end of it.
Original comment by abpil...@gmail.com
on 11 Feb 2010 at 7:09
Original issue reported on code.google.com by
andrei.p...@gmail.com
on 16 Jul 2008 at 6:11