modernmt / DataCollection

Data collection, alignment and TAUS repository
Apache License 2.0
20 stars 8 forks source link

Add rate-limiting for index server queries to locate_candidates_cc_index_api.py #15

Open achimr opened 7 years ago

achimr commented 7 years ago

locate_candidates_cc_index_api.py doesn't rate limit its queries to the CommonCrawl index server http://index.commoncrawl.org. The server is reported to be under heavy load frequently https://groups.google.com/forum/#!topic/common-crawl/o_MuZViu0O0. We should be nice and rate-limit our queries.

Workaround: run our own index server (see description how to in the mailing list thread)