Open joanna-yoo opened 1 year ago
Hmm @franklsf95 have you seen any similar issue from shuffle workloads? do you guys use S3 for spilling?
@rkooo567 Yes I think we've seen this when we were stress testing shuffle. In the end we opted to use local SSD instead of S3/GCS for shuffle because of this. For example, we couldn't get 1TB shuffle to work reliably on S3 due to all sorts of failures, and this is one that we have seen.
Although, I thought we were deprecating smart_open in favor of the PyArrow file systems? I remember this problem is gone with the Arrow implementation. Does Arrow support GCS yet?
I believe it supports (dataset already supports GCS iirc). I think this feature still needs more productionization (it is also not well tested in CI). There may be a chance this can be fixed sooner or later..
@joanna-yoo we recommend you to just use the disk based spilling, and if you need high throughput, consider mounting faster local SSD (e.g., NVME coming with i3 instances).
What happened + What you expected to happen
I'm running ray serve with k8, and I set up the object spilling config as follows:
But this is the error that I got for all of the writes.
Under which circumstances would this be false?
Versions / Dependencies
Docker base image:
rayproject/ray:2.2.0-py310-gpu
ray[serve]==2.2.0
Reproduction script
I cannot reproduce it in dev setup unfortunately. : ( This works fine locally.
Issue Severity
Medium: It is a significant difficulty but I can work around it.