pythonhacker / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
1 stars 3 forks source link

Add a "maxbyte" param as a control variable #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently there is no option to limit download by a cumulative maximum
amount of bytes. The file limit as of now only implements a maximum limit
on the size of a single URL.

Implement a "maxbyte" param which will monitor the maximum net download
size and end the crawl when it reaches the limit. 

Specifications
--------------
1. Param part of control:limits section.
2. Should accept plain numbers and also KB, MB and GB.
     For example.

    <maxbytes value="5000" /> - End crawl at 5000 bytes
    <maxbytes value="10kb" /> - End crawl at 10kb 
    <maxbytes value="50MB" /> - End crawl at 50 MB.
     <maxbytes value="1GB" /> - End crawl at 1 GB.

The value should accept strings like "5 kb", "5kb", "5.0KB",
"5 KB" etc. In other words spaces and case are ignored.

Original issue reported on code.google.com by abpil...@gmail.com on 23 Jun 2008 at 2:37

GoogleCodeExporter commented 9 years ago
Hint: Use a regular expression for parsing this param's value. Let me know if 
you
need help.

Original comment by abpil...@gmail.com on 23 Jun 2008 at 2:40

GoogleCodeExporter commented 9 years ago
Done 95%, need to implement logic to stop threads from saving data to disk after
controller kills them.

Original comment by abpil...@gmail.com on 24 Jun 2008 at 6:55

GoogleCodeExporter commented 9 years ago
Completed, please close the bug.

Original comment by abpil...@gmail.com on 25 Jun 2008 at 12:17

GoogleCodeExporter commented 9 years ago

Original comment by szybal...@gmail.com on 27 Jun 2008 at 4:47

GoogleCodeExporter commented 9 years ago
Verified. 
 <maxbytes value="300kb"/>
 <maxbandwidth value="30"/>
Final
 HarvestMan mirror foo completed in 18.46 seconds.
[00:01:00] 169 links scanned in 1 server .
[00:01:00] 11 files written.
[00:01:00] 409728  bytes received at the rate of 21.67 KB/sec .
[00:01:00] 292887  bytes were written to disk.

Thanks. That worked. It is time to download some websites at 5kb per second for 
few
hours. The config settings are working. 

Original comment by szybal...@gmail.com on 28 Jun 2008 at 5:03