Closed theyorubayesian closed 1 week ago
The issue here is on CommonCrawl side unfortunately. The s3 bucket is shared resource so there is high demand here. You can monitor the current use here https://status.commoncrawl.org/.
To mitigate the issue you can try:
Thanks for the response, @hynky1999. I'm sorry for raising an issue. I couldn't find a discussion section where I might ask.
Also, I'm writing output while processing to Azure adlfs but nothing has been written so far. I can see that documents have passed through my filters successfully because I logged details but nothing has been written to adlfs. Is there a minimum size because pushes happen?
When processing CommonCrawl, I frequently get SlowDown Errors:
{'Error': {'Code': 'SlowDown', 'Message': 'Please reduce your request rate.'}
. Is this common? Are there any recommended strategies for alleviating this issue?