pythonhacker / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
1 stars 3 forks source link

Implement download throttling to constraint download speeds #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
HarvestMan currently does not provide a way to limit download speeds by
"throttling". The only option to control download speed is to limit number
of simultaneous HTTP connections open, but this does not give a direct
control over the download speed.

This enhancement will implement a download throttling algorithm on the
HarvestMan connector, which will allow to specify an upper limit on the
download speed of the program in kb/sec . 

For example,

<throttle value="10" />

would mean that the maximum net download speed of all download threads,
should not exceed 10 kbps at any given moment.

Original issue reported on code.google.com by abpil...@gmail.com on 16 Jun 2008 at 8:41

GoogleCodeExporter commented 9 years ago
Currently in designing the algo for this...
Must have the initial implementation in 2 days.

Original comment by abpil...@gmail.com on 16 Jun 2008 at 1:27

GoogleCodeExporter commented 9 years ago
Fixed the issue. Changes in connector.py .

Original comment by abpil...@gmail.com on 25 Jun 2008 at 8:58

GoogleCodeExporter commented 9 years ago
Apparently, this needs rework. Quoting from mail by Lucas,

> I just tested the speed and for some reason I get:
>
> <maxbandwidth value="5"/>
>
> 4120 links scanned in 1 server .
> [01:12:54] 69 files written.
> [01:12:54] 1450624  bytes received at the rate of 8.12 KB/sec .
> [01:12:54] 9498335  bytes were written to disk.
>
>
> try crawling:
> http://www.automotive.com/used-cars/index.html
>
> to test this.
> Thanks.
> Lucas

Re-opening the bug for a better solution.

Original comment by abpil...@gmail.com on 28 Jun 2008 at 2:24

GoogleCodeExporter commented 9 years ago
Hope this one helps too.
http://www.ibm.com/developerworks/aix/library/au-threadingpython/
or
http://www.velocityreviews.com/forums/t583705-urllib2-rate-limiting.html

Lucas

Original comment by szybal...@gmail.com on 29 Jun 2008 at 4:45

GoogleCodeExporter commented 9 years ago
Fixed this by implementing the logic on the controller thread on all data 
downloaded
so far, instead on a per connector basis.

Lucas, please test this. Thanks!

Original comment by abpil...@gmail.com on 1 Jul 2008 at 9:36