spritt82 / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

Memory consumption optimization #13

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Start crawling a large website.
2. analyze memory consumption with a tool like "top" and see how it grows
continuously in a linear way.
3. observe how the crawler hangs after some point (probably due to 

What is the expected output? What do you see instead?
The memory consumption increases linearly in time making the crawler to
hang after some time although system has plenty of memory available. (1GB+).

What version of the product are you using? On what operating system?
2.0 alpha, Ubuntu 8.04 - x86_64, python 2.52

Original issue reported on code.google.com by andrei.p...@gmail.com on 16 Jul 2008 at 6:11

GoogleCodeExporter commented 9 years ago
Please see my comment to your report on issue #4. Can you verify this still 
persists
after fix for #6 ?

Original comment by abpil...@gmail.com on 17 Jul 2008 at 5:16

GoogleCodeExporter commented 9 years ago
Still the crawler hangs:

1. ps aux:
8696 46.0 31.1 898852 631948 pts/1   Sl+  21:46  13:51 python
/usr/lib/python2.5/site-packages/harvestman/apps/harvestman.py -C 
config-sample.xml

at about 30% of 2GB memory, after 30 minutes.

2. version number:
svn up:
At revision 79.

3. xml file:
xml config file contained <connections type="flush" />

4. number of tests: 2

5. ~ time from start to hanging: 30 minutes

Original comment by andrei.p...@gmail.com on 17 Jul 2008 at 7:19

GoogleCodeExporter commented 9 years ago
Saw your comment on #6. Thanks for reporting this!

Please update the bug with the config.xml you used for this crawl. I will test 
it out
and work out a fix.

Please attach the file to the bug, do not copy/paste it.

Thanks!

Original comment by abpil...@gmail.com on 17 Jul 2008 at 7:42

GoogleCodeExporter commented 9 years ago
Just to point out, this does not look like an issue with memory, but a logic 
flaw in
the state monitor. The state monitor is responsible for deciding when to exit 
the
program. I think it is not flawless and does not have a bail-out logic if it 
finds
the crawler cannot finish the crawl... this needs to be fixed.

Original comment by abpil...@gmail.com on 17 Jul 2008 at 7:56

GoogleCodeExporter commented 9 years ago
I would split the problem in two:

1. memory consumption increases linearly on large websites (using patched 
version also)
2. the program does not exit when it runs out of memory and hangs. 

Both of them seem important to me.

I tested again, this time after a system reboot. Now the available memory was 
larger
so the crawler ate about 65% of it, and hanged after ~ 1 hour. Please check the
attached configuration file.

Original comment by andrei.p...@gmail.com on 17 Jul 2008 at 9:27

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks andrei. I did not get an update from google for this comment, I wonder 
why...

Original comment by abpil...@gmail.com on 18 Jul 2008 at 7:29

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 18 Jul 2008 at 7:30

GoogleCodeExporter commented 9 years ago
I found a way to limit the effects of the memory consumption by specifying the 
time
for the crawl (restricted to 30 min.). (the xml config file has such an option).

If anyone is having the same problem.. this could be useful. 

Original comment by andrei.p...@gmail.com on 24 Jul 2008 at 4:26

GoogleCodeExporter commented 9 years ago
I think a good way to solve this is architectural change - i.e splitting the 
crawler
to 2 processes. A client/server paradigm over the current single process is the 
first
starting step. This is addressed in (updated) issue #18...

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:23

GoogleCodeExporter commented 9 years ago
I did a test with a crawler made by me (multi-threaded etc) and did not 
encounter the
same memory consumption problem. I think it might be an architectural problem. 
But if
the implementation is to be changed radically, we should skip this bug 
(although is
quite important in my opinion).

Original comment by andrei.p...@gmail.com on 13 Oct 2008 at 8:00

GoogleCodeExporter commented 9 years ago
I checked in a fix for the module bst the other day. A lot of leak is coming 
from the
use of dictcache in this module to store/load URL objects to/from the disk using
cPickle. I replaced this with Bsddb btree records, so the memory usage should 
improve
a bit. Still this is a work in progress...

Original comment by abpil...@gmail.com on 12 Jan 2009 at 8:22

GoogleCodeExporter commented 9 years ago
This needs testing - i will be running large crawl tests for a couple of days 
and
closing the bug if memory is fine at the end of it.

Original comment by abpil...@gmail.com on 11 Feb 2010 at 7:09