ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.82k stars 5.57k forks source link

[Data] Retry on `OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached` #43803

Closed jennifgcrl closed 1 month ago

jennifgcrl commented 5 months ago

What happened + What you expected to happen

ray.data.read_parquet_bulk crashes on parquet_base_datasource's pq.read_table with

OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

This error should be retried

Versions / Dependencies

ray==2.9.2

Reproduction script

large_dataset = ... # (e.g., s3 directory with 100k parquet files of 1GiB each of arbitrary data)
data = ray.data.read_parquet_bulk(large_dataset)
data.write_datasink(...) # write to s3

Issue Severity

High: It blocks me from completing my task.

bveeramani commented 5 months ago

Hey @jennifgcrl, could you share a full traceback? Also, what does your cluster look like (type and number of nodes), and how did you choose between read_parquet and read_parquet_bulk?

meltzerpete commented 5 months ago

Hi there,

I am seeing the same issue and it is a major blocker.

I am using read_parquet for 512 or more parquet chunks at a time and read_binary_files for millions of files. In each case all files are in a common root prefix on S3. I am using about 1000 CPUs, but sometimes more. More CPUs seems to cause the error more often.

Sometimes its error getting information for key, sometimes it's libcurl was given bad argument, sometimes it's timeout like above, however, it's always caused by transient network errors that need to be retried.

I see some fixes for similar problems have been applied in the past, but there are still many gaps.

Some previous related works:

The closest fix is this one: https://github.com/ray-project/ray/pull/42027 while it says it fixed for reads, inspecting the codebase in version 2.9.3 it looks like it's only actually fixed for writes.

The error I am seeing is AWS Error NETWORK_CONNECTION which is transient and must be retried. It is on the actual read tasks, as well as metadata tasks. Some gaps are here:

A bit of custom hackery into the codebase has improved the reliability greatly for me, although my solutions are too messy to PR. These code references should hopefully be helpful enough to indicate gaps in the current retry strategies though.

meltzerpete commented 5 months ago

Here's an example stacktrace with some stuff ***'d out:

Read progress 0: 100%|█████████▉| 3711/3712 [00:39<00:00, 49.63it/s]Traceback (most recent call last):
  File "/tmp/ray/session_2024-03-05_11-09-55_961367_273/runtime_resources/working_dir_files/_ray_pkg_ddef4fad01df9951/src/my_file.py", line 277, in <module>
    ***
  File "/tmp/ray/session_2024-03-05_11-09-55_961367_273/runtime_resources/working_dir_files/_ray_pkg_ddef4fad01df9951/src/my_file.py", line 273, in ***
  File "/tmp/ray/session_2024-03-05_11-09-55_961367_273/runtime_resources/working_dir_files/_ray_pkg_ddef4fad01df9951/src/my_file.py", line 152, in ***
    print('size of prev_batches:', prev_batches.count())
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 2606, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4779, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/lazy_block_list.py", line 293, in get_blocks
    blocks, _ = self._get_blocks_with_metadata()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/lazy_block_list.py", line 327, in _get_blocks_with_metadata
    meta = ray.get(refs_list.pop(-1))
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_execute_read_task_split() (pid=25144, ip=10.83.226.216)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/lazy_block_list.py", line 637, in _execute_read_task_split
    for block in blocks:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 237, in __call__
    yield from result
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 430, in __call__
    for block in blocks:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 371, in __call__
    for data in iter:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/parquet_datasource.py", line 508, in _read_fragments
    for batch in batches:
  File "pyarrow/_dataset.pyx", line 3414, in _iterator
  File "pyarrow/_dataset.pyx", line 3032, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key '***_000000.parquet' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
Sri-nidhi commented 5 months ago

@bveeramani May I know when we can expect a fix on this, Im facing the same issue when im trying to read around 42k parquet files from s3 directory

ronyw7 commented 4 months ago

We are seeing similar issues when reading ~1M images from a S3 bucket with read_images, which causes e2e training to end abruptly.

- ReadImage->Map(wnid_to_index): 4 active, 360 queued, [cpu: 4.0, objects: 673.3MB]:  67%|██████▋   | 1463/2188 [50:34<23:33,  1.95s/it]

- ReadImage->Map(wnid_to_index): 4 active, 360 queued, [cpu: 4.0, objects: 673.3MB]: 100%|██████████| 1463/1463 [50:34<00:00,  1.95s/it]

Traceback (most recent call last):
ray.exceptions.RayTaskError(OSError): ray::ReadImage->Map(wnid_to_index)() (pid=257328, ip=10.0.34.58)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 134, in _udf_timed_iter
    output = next(input)
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 216, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 239, in transform_fn
    for row in rows:
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 269, in __call__
    for block in blocks:
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 91, in do_read
    yield from call_with_retry(
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 164, in __call__
    yield from result
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 258, in read_task_fn
    yield from make_async_gen(
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/util.py", line 941, in make_async_gen
    raise next_item
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/util.py", line 918, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 232, in read_files
    with _open_file_with_retry(
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 522, in _open_file_with_retry
    return call_with_retry(
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/util.py", line 995, in call_with_retry
    raise e from None
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/_internal/util.py", line 982, in call_with_retry
    return f()
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 234, in <lambda>
    lambda: open_input_source(fs, read_path, **open_stream_args),
  File "/home/ubuntu/miniconda3/envs/ray-gpu/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 344, in _open_input_source
    file = filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
  File "pyarrow/_fs.pyx", line 822, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'IMAGE_KEY' in bucket 'MY_BUCKET': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
murthyn commented 4 months ago

Is there any update on fixing this? Also running into this and it is of high severity for me as well

anyscalesam commented 4 months ago

@murthyn - this is high priority for us but we're a little swamped; it's funded as part of our planning through May. Balaji will provide more details as he has them.

anyscalesam commented 3 months ago

re-reading this - @meltzerpete do you think you'd be game to contribute a PR; we can pair with someone to help shepherd through on the Anyscale side.

Your breakdown of the problem and whereabouts to solve is quite spectacular :) cc @c21

raulchen commented 3 months ago

hi @jennifgcrl , looks like this is a transient error. For ray 2.10+, we now support automatic retry of these kind of errors. Could you try again with the latest Ray?

anyscalesam commented 3 months ago

tag @murthyn @ronyw7 @Sri-nidhi @meltzerpete as well see @raulchen above^

meltzerpete commented 3 months ago

tag @murthyn @ronyw7 @Sri-nidhi @meltzerpete as well see @raulchen above^

Thanks for the updates. Hopefully this has been solved. I'm unavailable until Tuesday, but will test this then. Failing that happy to try and support with PR if there's someone to pair with.

meltzerpete commented 3 months ago

@raulchen @anyscalesam

I'm pretty sure I tried ray 2.10 when it was first released and found the issue persisted, however, I have tested today with ray 20.0.0 and can confirm the issue appears to be solved.

I am seeing no transient errors at all, thanks so much!

meltzerpete commented 3 months ago

Ah in fact perhaps I spoke too soon 😬

I am seeing this:

failed to write s3://*****: ray::ReadParquet() (pid=1061231, ip=10.83.226.188)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 91, in do_read
    yield from call_with_retry(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 163, in __call__
    yield from result
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/parquet_datasource.py", line 549, in _read_fragments
    for batch in batches:
  File "pyarrow/_dataset.pyx", line 3769, in _iterator
  File "pyarrow/_dataset.pyx", line 3387, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key '*****' in bucket '*****': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

I'm seeing this several times, but this looks like the only place it's erroring. Before there were many.

DimitarSirakov commented 3 months ago

Is there any update ? I'm facing the same issue...

anyscalesam commented 2 months ago

Hi Ray Community - TPM @ Anyscale here. We had a longgg discussion about this today and are currently thinking of the following to help mitigate this broad problem:

Next steps is we want to chat with y'all to make sure we get the requirements right for fixing this; can folks reach out to me on Ray Slack (if you're not on you can join on ray.io)

My username is "Sam (Ray Team)"

BitPhinix commented 4 weeks ago

Still seems to happening on latest with a webdataset datasource:

Traceback (most recent call last):
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_Inner.train() (pid=444169, ip=10.233.84.84, actor_id=dbde2580c35ca6be6e1eceb904000000, repr=TorchTrainer)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=445843, ip=10.233.84.84, actor_id=4b94fe3916cc6aa7126f7bba04000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f5acc472950>)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 169, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/tmp/ray/session_2024-07-31_04-25-04_745822_8880/runtime_resources/working_dir_files/_ray_pkg_6a7ed0f237d82b05/examples/brushnet/train_brushnet_sdxl_ray.py", line 297, in train_func_per_worker
    trainer.fit(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    self.fit_loop.run()
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 127, in __next__
    self.batches.append(super().__next__())
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/iterator.py", line 178, in _create_iterator
    for batch in iterator:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 178, in iter_batches
    next_batch = next(async_batch_iter)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 932, in make_async_gen
    raise next_item
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 909, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 167, in _async_iter_batches
    yield from extract_data_from_batch(batch_iter)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 211, in extract_data_from_batch
    for batch in batch_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 313, in restore_original_order
    for batch in batch_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 204, in finalize_batches
    for batch in batch_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 932, in make_async_gen
    raise next_item
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 909, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 220, in threadpool_computations_format_collate
    yield from formatted_batch_iter
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 180, in collate
    for batch in batch_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 159, in format_batches
    for batch in block_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 889, in __next__
    return next(self.it)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 118, in blocks_to_batches
    for block in block_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/util.py", line 55, in resolve_block_refs
    for block_ref in block_ref_iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 288, in prefetch_batches_locally
    next_ref_bundle = next(ref_bundles)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/util.py", line 889, in __next__
    return next(self.it)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 84, in gen_blocks
    ] = ray.get(future)
ray.exceptions.RayTaskError(OSError): ray::SplitCoordinator.get() (pid=446810, ip=10.233.84.84, actor_id=77cf38306952dea44e4bae7004000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x7fbc25dbe5c0>)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 224, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 76, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 145, in get_next
    item = self._outer._output_node.get_output_blocking(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 284, in get_output_blocking
    raise self._exception
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 145, in get_next
    item = self._outer._output_node.get_output_blocking(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 284, in get_output_blocking
    raise self._exception
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 145, in get_next
    item = self._outer._output_node.get_output_blocking(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 284, in get_output_blocking
    raise self._exception
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 222, in run
    continue_sched = self._scheduling_loop_step(self._topology)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 277, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 457, in process_completed_tasks
    raise e from None
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 424, in process_completed_tasks
    bytes_read = task.on_data_ready(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 105, in on_data_ready
    raise ex from None
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 101, in on_data_ready
    ray.get(block_ref)
ray.exceptions.RayTaskError(OSError): ray::ReadWebDataset->SplitBlocks(2)() (pid=447163, ip=10.233.84.84)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 101, in do_read
    yield from call_with_retry(
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 197, in __call__
    yield from result
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in read_task_fn
    yield from read_files(read_paths)
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 219, in read_files
    for block in read_stream(f, read_path):
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/datasource/webdataset_datasource.py", line 352, in _read_stream
    for sample in samples:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/datasource/webdataset_datasource.py", line 148, in _group_by_keys
    for filesample in data:
  File "/home/ubuntu/Brushnet/.venv/lib/python3.10/site-packages/ray/data/_internal/datasource/webdataset_datasource.py", line 121, in _tar_file_iterator
    data = stream.extractfile(tarinfo).read()
  File "/usr/lib/python3.10/tarfile.py", line 689, in read
    b = self.fileobj.read(length)
  File "/usr/lib/python3.10/tarfile.py", line 526, in read
    buf = self._read(size)
  File "/usr/lib/python3.10/tarfile.py", line 534, in _read
    return self.__read(size)
  File "/usr/lib/python3.10/tarfile.py", line 564, in __read
    buf = self.fileobj.read(self.bufsize)
  File "pyarrow/io.pxi", line 422, in pyarrow.lib.NativeFile.read
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached
anyscalesam commented 3 weeks ago

TY for the PR @BitPhinix; let's follow up there.