pythonhacker / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
1 stars 3 forks source link

Data flushing for connector file objects #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently, connector file object class (HarvestManFileObject) keeps all
downloaded data in memory for HarvestMan. 

This makes the program to use a lot of memory when downloading huge files
together. We can improve this situation by flushing of data to a temporary
file for every connection and finally copying/renaming this file to the
final file.

The file object already has logic for flushing data to temp files and for
copying/renaming etc, since this was already implemented for Hget. Need to
integrate this with the connect(...) function and save_url function to
implement this for HarvestMan.

Original issue reported on code.google.com by abpil...@gmail.com on 25 Jun 2008 at 12:14

GoogleCodeExporter commented 9 years ago
Fixed it. Connector objects by default flushes data to temporary files now 
instead of
keeping data in memory. 

Added a control var for this in <system> named "connections".

By default this is,

<connections type="flush" />

This is equivalent to,

<connections type="0" />

This means keep flushing data to temporary files. This improves the memory 
usage of
the program. To reset this to keeping data in-memory,

<connections type="mem" />

this is same as,

<connections type="1" />

Since we had an earlier element named "connections", I renamed it to
"maxconnections".

By default temporary files are saved in the folder ".tmp" in the project folder 
of
the crawl project. For the time being I am not removing this folder at the end 
of
crawl (for debugging), but this will be done later.

Original comment by abpil...@gmail.com on 7 Jul 2008 at 8:05

GoogleCodeExporter commented 9 years ago
Still the crawler hangs:

1. ps aux:
8696 46.0 31.1 898852 631948 pts/1   Sl+  21:46  13:51 python
/usr/lib/python2.5/site-packages/harvestman/apps/harvestman.py -C 
config-sample.xml

at about 30% of 2GB memory, after 30 minutes.

2. version number:
svn up:
At revision 79.

3. xml file:
xml config file contained <connections type="flush" />

4. number of tests: 2

5. ~ time from start to hanging: 30 minutes

Original comment by andrei.p...@gmail.com on 17 Jul 2008 at 7:19