skyplane-project / skyplane

πŸ”₯ Blazing fast bulk data transfers between any cloud πŸ”₯
https://skyplane.org
Apache License 2.0
1.06k stars 60 forks source link

ValueError while running 1M file transfer #541

Open antonzabreyko opened 2 years ago

antonzabreyko commented 2 years ago

When running a transfer job consisting of one million files, the following error occurs:

βœ“ Initializing cloud keys (5/5) in 4.47s
βœ“ Provisioning gateway instances (2/2) in 35.42s
βœ“ Installing gateway package (2/2) in 13.49s
πŸš€ 0.00GB transfer job launched
⠏ Transfer progress (67655 of 1000000 chunks) ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.1/976.6 KiB ? -:--:--
Compression saved -2300.00% of egress fees
βœ“ Deprovisioning instances (2/2) in 2.70s

❌ AWSServer(region_tag=aws:us-west-2, instance_id=i-0a2071ce582135eff) encountered error:
Traceback (most recent call last):
  File "/pkg/skyplane/gateway/gateway_sender.py", line 95, in worker_loop
    self.send_chunks([next_chunk_id], dest_ip)
  File "/pkg/skyplane/gateway/gateway_sender.py", line 184, in send_chunks
    self.chunk_store.state_start_upload(chunk_id, f"sender:{self.worker_id}")
  File "/pkg/skyplane/gateway/chunk_store.py", line 93, in state_start_upload
    raise ValueError(f"Invalid transition start_upload from {state} (id={chunk_id})")
ValueError: Invalid transition start_upload from ChunkState.downloaded (id=327334)

It appears this error non-deterministically happens, as previously it occurred at roughly 300 KiB transferred.

parasj commented 2 years ago

To update state on this PR, @antonzabreyko and I discussed two mitigations:

  1. Testing the gateway with a less than check on chunk status instead of an equality check. Ideally, we print an error on state mismatches to count how many chunks encounter a status inconsistency.
  2. A more intensive mitigation involving replacing the ChunkStore status log with per-thread state that is not shared (i.e. store state in queues). The ChunkStore would then retain a single dict that contains the set of chunks that have completed rather than a granular log.
sarahwooders commented 1 year ago

Is this still an issue?

parasj commented 1 year ago

Yeah, this is still a pretty major issue. We can close this when we migrate to the ChunkStore-free gateway but that hasn't been integrated yet. The issue is that the manager cannot keep up with high QPS with many chunks.

antonzabreyko commented 1 year ago

I can issue a quick patch that slackens the state check if it's needed before the ChunkStore is removed. Just let me know.

parasj commented 1 year ago

Let's wait until the ChunkStore free gateway code is merged (Sarah is working on this) and then @antonzabreyko should retest that code to see if the issue persists.