vilsonrodrigues / face-recognition

A scalable face recognition system
MIT License
3 stars 0 forks source link

FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg' #1

Open vbsantos opened 2 months ago

vbsantos commented 2 months ago

Issue: FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'

Description

I am encountering a FileNotFoundError when trying to run the job_lfw job using Ray on a Kubernetes cluster. The error occurs after downloading the dataset (and doing some processing for the second time) when Ray tries to open a local file that apparently does not exist. I am new to the Python and Kubernetes ecosystem, so I apologize if this is a basic error.

Details

Error Details

The complete error is as follows:

2024-07-06 17:40:19,695 INFO cli.py:36 -- Job submission server address: http://rayjob-lfw-raycluster-2cv5x-head-svc.default.svc.cluster.local:8265
2024-07-06 17:40:20,438 SUCC cli.py:60 -- ---------------------------------------------
2024-07-06 17:40:20,438 SUCC cli.py:61 -- Job 'rayjob-lfw-t4mxs' submitted successfully
2024-07-06 17:40:20,438 SUCC cli.py:62 -- ---------------------------------------------
2024-07-06 17:40:20,438 INFO cli.py:274 -- Next steps
2024-07-06 17:40:20,438 INFO cli.py:275 -- Query the logs of the job:
2024-07-06 17:40:20,438 INFO cli.py:277 -- ray job logs rayjob-lfw-t4mxs
2024-07-06 17:40:20,438 INFO cli.py:279 -- Query the status of the job:
2024-07-06 17:40:20,438 INFO cli.py:281 -- ray job status rayjob-lfw-t4mxs
2024-07-06 17:40:20,438 INFO cli.py:283 -- Request the job to be stopped:
2024-07-06 17:40:20,439 INFO cli.py:285 -- ray job stop rayjob-lfw-t4mxs
2024-07-06 17:40:20,444 INFO cli.py:292 -- Tailing logs until the job exits (disable with --no-wait):
Downlaod dataset vilsonrodrigues/lfw/lfw_multifaces-ingestion.zip

lfw_multifaces-ingestion.zip:   0%|          | 0.00/69.1M [00:00<?, ?B/s]
lfw_multifaces-ingestion.zip:  15%|█▌        | 10.5M/69.1M [00:00<00:00, 89.1MB/s]
lfw_multifaces-ingestion.zip:  46%|████▌     | 31.5M/69.1M [00:00<00:00, 138MB/s] 
lfw_multifaces-ingestion.zip:  76%|███████▌  | 52.4M/69.1M [00:00<00:00, 158MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 160MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 149MB/s]
Unzip dataset
Load images with Ray Data
2024-07-06 17:40:24,611 INFO worker.py:1329 -- Using address 10.2.1.35:6379 set in the environment variable RAY_ADDRESS
2024-07-06 17:40:24,611 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.2.1.35:6379...
2024-07-06 17:40:24,618 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 10.2.1.35:8265 
Start map batch processing
Batch map process finish
2024-07-06 17:40:25,241 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor)] -> ActorPoolMapOperator[MapBatches(BatchFacePostProcessing)] -> ActorPoolMapOperator[MapBatches(MobileFaceNetORTBatchPredictor)]
2024-07-06 17:40:25,242 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:40:25,242 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-07-06 17:40:25,596 INFO actor_pool_map_operator.py:106 -- ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor): Waiting for 1 pool actors to start...
2024-07-06 17:40:28,497 INFO actor_pool_map_operator.py:106 -- MapBatches(BatchFacePostProcessing): Waiting for 1 pool actors to start...
2024-07-06 17:40:29,404 INFO actor_pool_map_operator.py:106 -- MapBatches(MobileFaceNetORTBatchPredictor): Waiting for 1 pool actors to start...

Running 0:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.1 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.53 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.5 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s] 
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.73 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:05<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:05<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.09 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:05<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.46 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.49 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.9 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]  
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:08<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:08<04:15,  4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.08 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:08<04:15,  4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.47 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:10<04:15,  4.05s/it]

(omited)

Running: 1.3/5.0 CPU, 0.0/0.0 GPU, 16.62 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:28<00:03,  3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.38 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:28<00:03,  3.17s/it] 
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:29<00:03,  3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00,  3.26s/it]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00,  3.26s/it] 

2024-07-06 17:44:00,933 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune
2024-07-06 17:44:02,479 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)]
2024-07-06 17:44:02,480 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:44:02,480 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

Running 0:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.07 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:00<00:25,  2.43it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.03 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:00<00:25,  2.43it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:01<00:25,  2.43it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.12 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:01<00:32,  1.86it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:02<00:32,  1.86it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:02<00:32,  1.86it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   8%|▊         | 5/65 [00:02<00:27,  2.19it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory:   8%|▊         | 5/65 [00:02<00:27,  2.19it/s]  
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory: 100%|██████████| 5/5 [00:02<00:00,  2.19it/s] 

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
    meta = ray.get(next(self._streaming_gen))
  File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/job_hf.py", line 153, in <module>
    df = ds.to_pandas()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4242, in to_pandas
    count = self.count()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 2498, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 591, in execute
    blocks = execute_to_legacy_block_list(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
    for ref_bundle in bundles:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    active_tasks[ref].on_waitable_ready()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
    ex = ray.get(block_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): ray::ReadImage->Map(parse_filename)->MapBatches(<lambda>)() (pid=233, ip=10.2.0.153)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 122, in apply_transform
    iter = transform_fn(iter, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 263, in __call__
    first = next(block_iter, None)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 232, in transform_fn
    for row in rows:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
    for block in blocks:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 207, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 122, in do_read
    yield from read_task()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 225, in __call__
    yield from result
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 610, in read_task_fn
    yield from make_async_gen(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 769, in make_async_gen
    raise next_item
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 746, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 589, in read_files
    with _open_file_with_retry(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 1000, in _open_file_with_retry
    raise e from None
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
    return open_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
    lambda: open_input_source(fs, read_path, **open_stream_args),
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
    return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
  File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory
2024-07-06 17:44:06,427 ERR cli.py:68 -- -----------------------------
2024-07-06 17:44:06,427 ERR cli.py:69 -- Job 'rayjob-lfw-t4mxs' failed
2024-07-06 17:44:06,427 ERR cli.py:70 -- -----------------------------
2024-07-06 17:44:06,427 INFO cli.py:83 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
    return open_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
    lambda: open_input_source(fs, read_path, **open_stream_args),
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
    return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
  File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory

Could you please help me understand why this error is occurring and how to resolve it? Any guidance or suggestions would be greatly appreciated.

Thank you in advance for your assistance and support.

vilsonrodrigues commented 2 months ago

Hi Vinícius, you can run execute script job_hf.py in local python (out of kubernetes)?

It appears to be a hardware limitation error:

2024-07-06 17:44:00,933 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune

in the Kubernetes Job you can increase the hardware specifications that Ray can use, remember to also increase the environment variables that limit Ray's use of the hardware