huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.06k stars 149 forks source link

Frequent S3 Slowdown Error #302

Closed theyorubayesian closed 1 week ago

theyorubayesian commented 1 week ago

When processing CommonCrawl, I frequently get SlowDown Errors: {'Error': {'Code': 'SlowDown', 'Message': 'Please reduce your request rate.'}. Is this common? Are there any recommended strategies for alleviating this issue?

hynky1999 commented 1 week ago

The issue here is on CommonCrawl side unfortunately. The s3 bucket is shared resource so there is high demand here. You can monitor the current use here https://status.commoncrawl.org/.

To mitigate the issue you can try:

  1. Reducing number of concurrent downloands (reduce number of workers).
  2. Use the random sleep feature for each worker so they don't all start fetching at the same time
  3. Increase retries for aws fetching (it should be env variable)
theyorubayesian commented 1 week ago

Thanks for the response, @hynky1999. I'm sorry for raising an issue. I couldn't find a discussion section where I might ask.

Also, I'm writing output while processing to Azure adlfs but nothing has been written so far. I can see that documents have passed through my filters successfully because I logged details but nothing has been written to adlfs. Is there a minimum size because pushes happen?