hsci-r / finnish-media-scrapers

Scrapers for extracting articles from major Finnish journalistic media outlets
MIT License
7 stars 3 forks source link

Issue with `randrange` in delay configuration #21

Closed anttiope closed 1 month ago

anttiope commented 1 month ago

Title: Issue with randrange in delay configuration

Description: The README states the intention to be a good netizen by defaulting to a one-second delay between each web request to media websites to avoid undue load on their servers. This delay is configurable using command line parameters.

By default, the delay is set at 1.0 seconds (float). However, randrange from random takes integers as arguments, which gives an error. This occurs, for instance, with the file query_yle.py on line 59: sleep(random.randrange(args.delay*2)).

The code can be made to work by adding sleep(random.randrange(int(args.delay*2))), but with the default of 1.0, this results in a random integer generation with a range from 0 to 1, so the default value of 1.0 results in a 1-second delay only 50% of the time.

Steps to Reproduce:

  1. Set the delay to 1.0 seconds.
  2. Run the script query_yle.py.
  3. Observe the error with randrange.

Expected Behavior: The script should introduce a delay without errors.

Actual Behavior: The script throws an error due to randrange requiring integer arguments.

Proposed Solution: If randomness is desired, perhaps something like sleep(random.uniform(0, args.delay*2)) could be used, which would result in a floating point delay range between 0.0 and 2.0 (if the delay is set at 1.0, the default).

Affected Files:

Environment:

jiemakel commented 1 month ago

The random sleep was probably to lessen the bot-likeness of the crawler, with the intent that over time, the mean sleep would end up being 1 second (or whatever is specified in the arguments).

Not sure the anti-bot-likeness is even needed, but anyway it is clear that at least some modification of the code is warranted. Pull requests welcome.

jiemakel commented 1 month ago

Thanks for the fix!