Closed RobbeSneyders closed 7 months ago
Thanks for the report @vpvsankar.
Two questions to help us narrow this down:
I didn't make any changes to the pipeline. I am using GPU for runnig the pipeline
And using the local runner @vpvsankar?
I submitted a fix to fondant at https://github.com/ml6team/fondant/pull/904.
You can already test it by installing the commit and using the --build-arg
flag with the local runner:
pip uninstall fondant
pip install fondant@git+https://github.com/ml6team/fondant@807304c
fondant run local pipeline.py --build-arg FONDANT_VERSION=807304c
Still getting some error
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,611 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:43941'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,611 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,612 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:40815. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,613 | distributed.worker | ERROR] Failed to communicate with scheduler during heartbeat.
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
retrieve_from_faiss_by_prompt-1 | frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
retrieve_from_faiss_by_prompt-1 | tornado.iostream.StreamClosedError: Stream is closed
retrieve_from_faiss_by_prompt-1 |
retrieve_from_faiss_by_prompt-1 | The above exception was the direct cause of the following exception:
retrieve_from_faiss_by_prompt-1 |
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1252, in heartbeat
retrieve_from_faiss_by_prompt-1 | response = await retry_operation(
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 455, in retry_operation
retrieve_from_faiss_by_prompt-1 | return await retry(
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 434, in retry
retrieve_from_faiss_by_prompt-1 | return await coro()
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
retrieve_from_faiss_by_prompt-1 | return await send_recv(comm=comm, op=key, **kwargs)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1154, in send_recv
retrieve_from_faiss_by_prompt-1 | response = await comm.read(deserializers=deserializers)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
retrieve_from_faiss_by_prompt-1 | convert_stream_closed_error(self, e)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
retrieve_from_faiss_by_prompt-1 | raise CommClosedError(f"in {obj}: {exc}") from exc
retrieve_from_faiss_by_prompt-1 | distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:46964 remote=tcp://127.0.0.1:37215>: Stream is closed
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,615 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,618 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:46952; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:22,618 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:40815', name: 8, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182962.6187515')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:23,485 | faiss.loader | INFO] Loading faiss with AVX2 support.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:23,502 | faiss.loader | INFO] Successfully loaded faiss with AVX2 support.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:26,977 | distributed.nanny.memory | WARNING] Worker tcp://127.0.0.1:38329 (pid=98) exceeded 95% memory budget. Restarting...
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,084 | distributed.core | INFO] Connection to tcp://127.0.0.1:38152 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,084 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:38329', name: 14, status: running, memory: 9, processing: 9> (stimulus_id='handle-worker-cleanup-1710182967.0843165')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,084 | distributed.scheduler | INFO] Task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 4) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,084 | distributed.scheduler | INFO] Task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 7) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,085 | distributed.scheduler | INFO] Task ('getitem-29877aa25428d6f816aee94f3db313ba', 5) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,085 | distributed.scheduler | INFO] Task ('getitem-29877aa25428d6f816aee94f3db313ba', 2) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,085 | distributed.scheduler | INFO] Task ('getitem-29877aa25428d6f816aee94f3db313ba', 8) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,085 | distributed.scheduler | INFO] Task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 0) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,085 | distributed.scheduler | INFO] Task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 3) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,086 | distributed.scheduler | INFO] Task ('getitem-29877aa25428d6f816aee94f3db313ba', 1) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,086 | distributed.scheduler | INFO] Task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 6) marked as failed because 4 workers died while trying to run it
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,087 | distributed.nanny | INFO] Worker process 98 was killed by signal 15
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/bin/fondant", line 8, in <module>
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,096 | distributed.nanny | WARNING] Restarting worker
retrieve_from_faiss_by_prompt-1 | sys.exit(entrypoint())
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 89, in entrypoint
retrieve_from_faiss_by_prompt-1 | args.func(args)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 711, in execute
retrieve_from_faiss_by_prompt-1 | executor.execute(component)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 335, in execute
retrieve_from_faiss_by_prompt-1 | output_manifest = self._run_execution(component_cls, input_manifest)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 301, in _run_execution
retrieve_from_faiss_by_prompt-1 | self._write_data(dataframe=output_df, manifest=output_manifest)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 219, in _write_data
retrieve_from_faiss_by_prompt-1 | data_writer.write_dataframe(dataframe)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 178, in write_dataframe
retrieve_from_faiss_by_prompt-1 | self._write_dataframe(dataframe)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 251, in _write_dataframe
retrieve_from_faiss_by_prompt-1 | future.result()
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 322, in result
retrieve_from_faiss_by_prompt-1 | return self.client.sync(self._result, callback_timeout=timeout)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 330, in _result
retrieve_from_faiss_by_prompt-1 | raise exc.with_traceback(tb)
retrieve_from_faiss_by_prompt-1 | distributed.scheduler.KilledWorker: Attempted to run task ('repartition-merge-3b7708bcfe130f1d349a06e5d936529f', 6) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:38329. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,103 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:32847'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,104 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,104 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:38327'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,104 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,104 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:33507'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,104 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:45611. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,105 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,105 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:36299'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,105 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:44987. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,105 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,105 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:41793'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:40333. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:36021'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:34629. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,106 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:40435'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:37287. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:38883'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,107 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:45889. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:44053'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,108 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:39293'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:44693. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:33757'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:37181. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,109 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,110 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:40641. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,110 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:34415. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,110 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,110 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:34999'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,110 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:44801'. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,111 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,111 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,111 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,111 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:45743. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,112 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,112 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,114 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:44837. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,112 | distributed.worker | ERROR] Failed to communicate with scheduler during heartbeat.
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
retrieve_from_faiss_by_prompt-1 | frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
retrieve_from_faiss_by_prompt-1 | tornado.iostream.StreamClosedError: Stream is closed
retrieve_from_faiss_by_prompt-1 |
retrieve_from_faiss_by_prompt-1 | The above exception was the direct cause of the following exception:
retrieve_from_faiss_by_prompt-1 |
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1252, in heartbeat
retrieve_from_faiss_by_prompt-1 | response = await retry_operation(
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 455, in retry_operation
retrieve_from_faiss_by_prompt-1 | return await retry(
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 434, in retry
retrieve_from_faiss_by_prompt-1 | return await coro()
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
retrieve_from_faiss_by_prompt-1 | return await send_recv(comm=comm, op=key, **kwargs)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1154, in send_recv
retrieve_from_faiss_by_prompt-1 | response = await comm.read(deserializers=deserializers)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
retrieve_from_faiss_by_prompt-1 | convert_stream_closed_error(self, e)
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
retrieve_from_faiss_by_prompt-1 | raise CommClosedError(f"in {obj}: {exc}") from exc
retrieve_from_faiss_by_prompt-1 | distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:38400 remote=tcp://127.0.0.1:37215>: Stream is closed
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,115 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,116 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,119 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38208; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,119 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38092; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,120 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38134; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,120 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38124; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,120 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38100; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,120 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38110; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,120 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38104; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,121 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38200; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,121 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38184; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,121 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38106; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,121 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38224; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,122 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:45611', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.122518')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,123 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:44987', name: 2, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1232057')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,123 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:40333', name: 4, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1237094')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,124 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:34629', name: 3, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1241848')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,124 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:37287', name: 6, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1246436')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,125 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:45889', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1250832')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,125 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:44693', name: 10, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1255293')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,126 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:37181', name: 9, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1259623')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,126 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:40641', name: 11, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1264026')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,126 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:34415', name: 12, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1268284')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,127 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:45743', name: 13, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1271574')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,127 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:38142; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,128 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:44837', name: 15, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182967.1286283')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,128 | distributed.scheduler | INFO] Lost all workers
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:27,131 | distributed.batched | INFO] Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:37215 remote=tcp://127.0.0.1:38142>
retrieve_from_faiss_by_prompt-1 | Traceback (most recent call last):
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
retrieve_from_faiss_by_prompt-1 | nbytes = yield coro
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
retrieve_from_faiss_by_prompt-1 | value = future.result()
retrieve_from_faiss_by_prompt-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
retrieve_from_faiss_by_prompt-1 | raise CommClosedError()
retrieve_from_faiss_by_prompt-1 | distributed.comm.core.CommClosedError
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Start worker at: tcp://127.0.0.1:43757
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Listening to: tcp://127.0.0.1:43757
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Worker name: 14
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] dashboard at: 127.0.0.1:41379
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Waiting to connect to: tcp://127.0.0.1:37215
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] -------------------------------------------------
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Threads: 1
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Memory: 3.93 GiB
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] Local Directory: /tmp/dask-scratch-space/worker-m3dlcs05
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,387 | distributed.worker | INFO] -------------------------------------------------
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,392 | distributed.scheduler | INFO] Register worker <WorkerState 'tcp://127.0.0.1:43757', name: 14, status: init, memory: 0, processing: 0>
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,392 | distributed.scheduler | INFO] Starting worker compute stream, tcp://127.0.0.1:43757
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,392 | distributed.core | INFO] Starting established connection to tcp://127.0.0.1:54684
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,393 | distributed.worker | INFO] Starting Worker plugin shuffle
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,393 | distributed.worker | INFO] Registered to: tcp://127.0.0.1:37215
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,393 | distributed.worker | INFO] -------------------------------------------------
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,394 | distributed.core | INFO] Starting established connection to tcp://127.0.0.1:37215
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,418 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,419 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:43757. Reason: nanny-close
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,420 | distributed.core | INFO] Connection to tcp://127.0.0.1:37215 has been closed.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,420 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:54684; closing.
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,420 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:43757', name: 14, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710182968.4206657')
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,420 | distributed.scheduler | INFO] Lost all workers
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,557 | distributed.scheduler | INFO] Scheduler closing due to unknown reason...
retrieve_from_faiss_by_prompt-1 | [2024-03-11 18:49:28,557 | distributed.scheduler | INFO] Scheduler closing all comms
retrieve_from_faiss_by_prompt-1 exited with code 1
Gracefully stopping... (press Ctrl+C again to force)
service "retrieve_from_faiss_by_prompt" didn't complete successfully: exit 1
Finished pipeline run.
I am using Vertex AI workbench.
It looks like your machine might be too small. @mrchtr can you provide the minimum spec needed to run the component?
That would be great, so that I can create a large machine
Hi @vpvsankar, I revisited the components and the pipeline.
Firstly, there was a bug in the pipeline. We didn't set Resources
properly, which means we are not leveraging the GPU if one is available. I've opened a PR (#15) for this. You could take a look at it and try again. This should solve your issue if you are using a GPU.
I've tested the pipeline with in a Vertex AI workbench and used n1-standard-8
with 1x NVIDIA T4
.
Due to this behavior, your instance has used the CPU instead. Depending on the size of your machine, too little memory was assigned to a single dask worker. The current faiss index and clip model need approximately 7Gb of RAM. I've updated the component code as well, to avoid running into this when executing the component on a CPU.
I used n1-highmem-16 machine with T4 GPU. Currently I have 6.39 GB per worker, is there a way to restrict the number of workers? Do I need to update the fondant package?
istributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 4.47 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:03,771 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.14 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:03,841 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.14 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:03,869 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.15 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:03,871 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.12 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:03,959 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.11 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,064 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.13 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,072 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.12 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,088 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.14 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,090 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.14 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,193 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.15 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,227 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.14 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,561 | distributed.worker.memory | WARNING] Worker is at 81% memory usage. Pausing worker. Process memory: 5.17 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,651 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.17 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,772 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.12 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,788 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,854 | distributed.worker.memory | WARNING] Worker is at 36% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,909 | distributed.worker.memory | WARNING] Worker is at 81% memory usage. Pausing worker. Process memory: 5.17 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:04,959 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,048 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,128 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,178 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:05,181 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:05,184 | distributed.worker.memory | WARNING] Worker is at 80% memory usage. Pausing worker. Process memory: 5.15 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,242 | distributed.worker.memory | WARNING] Worker is at 36% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:05,245 | distributed.worker.memory | WARNING] Worker is at 36% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:05,248 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | [2024-03-12 12:36:05,255 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,527 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 17)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f4a9ff4e8c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 17), ('assign_index-4ed796d4a93d97309656b8eba966256e', 17): (<function assign_index at 0x7f4aa020bd90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 17), ('operation-a20a9f3b215adf1885beefc41c69cffe', 17)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 17): (<function RenameSeries.operation at 0x7f4aa01d44c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 17), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 17): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 17), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 17): (<function apply at 0x7f4ab41cb250>, <function apply_and_enforce at 0x7f4aa02411b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 17)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | [2024-03-12 12:36:05,527 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 27)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f3303ae28c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 27), ('assign_index-4ed796d4a93d97309656b8eba966256e', 27): (<function assign_index at 0x7f3303dbbd90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 27), ('operation-a20a9f3b215adf1885beefc41c69cffe', 27)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 27): (<function RenameSeries.operation at 0x7f3303abc4c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 27), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 27): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 27), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 27): (<function apply at 0x7f3306f87250>, <function apply_and_enforce at 0x7f3303df11b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 27)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | [2024-03-12 12:36:05,533 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 9)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f0e3277a8c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 9), ('assign_index-4ed796d4a93d97309656b8eba966256e', 9): (<function assign_index at 0x7f0e32a53d90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 9), ('operation-a20a9f3b215adf1885beefc41c69cffe', 9)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 9): (<function RenameSeries.operation at 0x7f0e327544c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 9), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 9): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 9), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 9): (<function apply at 0x7f0e448d3250>, <function apply_and_enforce at 0x7f0e32a891b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 9)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at 0x7f0e237
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | Traceback (most recent call last):
caption_images-1 | File "/opt/conda/bin/fondant", line 8, in <module>
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,545 | distributed.worker.memory | WARNING] Worker is at 35% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | sys.exit(entrypoint())
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 89, in entrypoint
caption_images-1 | [2024-03-12 12:36:05,548 | distributed.worker.memory | WARNING] Worker is at 36% memory usage. Resuming worker. Process memory: 2.30 GiB -- Worker memory limit: 6.39 GiB
caption_images-1 | args.func(args)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 711, in execute
caption_images-1 | executor.execute(component)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 335, in execute
caption_images-1 | output_manifest = self._run_execution(component_cls, input_manifest)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 301, in _run_execution
caption_images-1 | [2024-03-12 12:36:05,551 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 29)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f3303ae28c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 29), ('assign_index-4ed796d4a93d97309656b8eba966256e', 29): (<function assign_index at 0x7f3303dbbd90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 29), ('operation-a20a9f3b215adf1885beefc41c69cffe', 29)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 29): (<function RenameSeries.operation at 0x7f3303abc4c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 29), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 29): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 29), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 29): (<function apply at 0x7f3306f87250>, <function apply_and_enforce at 0x7f3303df11b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 29)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | self._write_data(dataframe=output_df, manifest=output_manifest)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 219, in _write_data
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | data_writer.write_dataframe(dataframe)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 178, in write_dataframe
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,561 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 16)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f0e3277a8c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 16), ('assign_index-4ed796d4a93d97309656b8eba966256e', 16): (<function assign_index at 0x7f0e32a53d90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 16), ('operation-a20a9f3b215adf1885beefc41c69cffe', 16)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 16): (<function RenameSeries.operation at 0x7f0e327544c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 16), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 16): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 16), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 16): (<function apply at 0x7f0e448d3250>, <function apply_and_enforce at 0x7f0e32a891b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 16)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,585 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 19)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f52788d68c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 19), ('assign_index-4ed796d4a93d97309656b8eba966256e', 19): (<function assign_index at 0x7f5278babd90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 19), ('operation-a20a9f3b215adf1885beefc41c69cffe', 19)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 19): (<function RenameSeries.operation at 0x7f52788b04c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 19), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 19): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 19), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 19): (<function apply at 0x7f527bd6b250>, <function apply_and_enforce at 0x7f5278bdd1b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 19)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | self._write_dataframe(dataframe)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 251, in _write_dataframe
caption_images-1 | future.result()
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 328, in result
caption_images-1 | return self.client.sync(self._result, callback_timeout=timeout)
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,608 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 7)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f3c63f8a8c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 7), ('assign_index-4ed796d4a93d97309656b8eba966256e', 7): (<function assign_index at 0x7f3c64263d90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 7), ('operation-a20a9f3b215adf1885beefc41c69cffe', 7)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 7): (<function RenameSeries.operation at 0x7f3c63f644c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 7), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 7): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 7), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 7): (<function apply at 0x7f3c781c7250>, <function apply_and_enforce at 0x7f3c642951b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 7)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at 0x7f3c60f
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/dask_expr/_expr.py", line 3563, in _execute_task
caption_images-1 | [2024-03-12 12:36:05,610 | distributed.worker | WARNING] Compute Failed
caption_images-1 | Key: ('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 24)
caption_images-1 | Function: execute_task
caption_images-1 | args: ((<function Fused._execute_task at 0x7f52788d68c0>, {'readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21': ('assign_index-4ed796d4a93d97309656b8eba966256e', 24), ('assign_index-4ed796d4a93d97309656b8eba966256e', 24): (<function assign_index at 0x7f5278babd90>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 24), ('operation-a20a9f3b215adf1885beefc41c69cffe', 24)), ('operation-a20a9f3b215adf1885beefc41c69cffe', 24): (<function RenameSeries.operation at 0x7f52788b04c0>, ('getattr-60ed039df315a704545f01430f8fefd9', 24), 'id', False), ('getattr-60ed039df315a704545f01430f8fefd9', 24): (<built-in function getattr>, ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 24), 'index'), ('wrapped_transform-5ce0a07e59ff550775a00255566a16e3', 24): (<function apply at 0x7f527bd6b250>, <function apply_and_enforce at 0x7f5278bdd1b0>, [('operation-e9daee3a6e53e1ba6cc58af15f8c2881', 24)], {'_func': <function PandasTransformExecutor.wrap_transform.<locals>.wrapped_transform at
caption_images-1 | kwargs: {}
caption_images-1 | Exception: "ValueError('No objects to concatenate')"
caption_images-1 |
caption_images-1 | return dask.core.get(graph, name)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 495, in wrapped_transform
caption_images-1 | dataframe = transform(dataframe)
caption_images-1 | File "/component/src/./main.py", line 116, in transform
caption_images-1 | return pd.concat(results).to_frame(name="caption")
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 507, in _clean_keys_and_objs
caption_images-1 | raise ValueError("No objects to concatenate")
caption_images-1 | ValueError: No objects to concatenate
caption_images-1 | /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.
caption_images-1 | return bound(*args, **kwds)
caption_images-1 | [2024-03-12 12:36:05,629 | distributed.scheduler | INFO] Retire worker addresses (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
caption_images-1 | [2024-03-12 12:36:05,631 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:33031'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,631 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,631 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:38841'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,632 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,632 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:46379'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,632 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,633 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:39491'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,633 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,634 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:43751. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,634 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:42667'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,635 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:36089. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,635 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,635 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:38653'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,636 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 23))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,636 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 4))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,638 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,638 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:40643'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,638 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,638 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,639 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:40353'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,639 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,639 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,640 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:36743'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,640 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,640 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:37565. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,640 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:36013. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,640 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:35193'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,641 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,641 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:42315'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,641 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:36305. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,641 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 12))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:34473'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 1))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:39941. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,642 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:33559'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,643 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:44079. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,643 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,643 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:46615'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,644 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 30))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,644 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 11))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,644 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,644 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:40001'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,644 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,645 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:35487'. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,645 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:33881. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,645 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,646 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:46067. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,646 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:34039. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,645 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,646 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:34907. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,646 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,646 | distributed.worker | ERROR] Failed to communicate with scheduler during heartbeat.
caption_images-1 | Traceback (most recent call last):
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
caption_images-1 | frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
caption_images-1 | tornado.iostream.StreamClosedError: Stream is closed
caption_images-1 |
caption_images-1 | The above exception was the direct cause of the following exception:
caption_images-1 |
caption_images-1 | Traceback (most recent call last):
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 1252, in heartbeat
caption_images-1 | response = await retry_operation(
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 455, in retry_operation
caption_images-1 | return await retry(
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 434, in retry
caption_images-1 | return await coro()
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
caption_images-1 | return await send_recv(comm=comm, op=key, **kwargs)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1154, in send_recv
caption_images-1 | response = await comm.read(deserializers=deserializers)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
caption_images-1 | convert_stream_closed_error(self, e)
caption_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
caption_images-1 | raise CommClosedError(f"in {obj}: {exc}") from exc
caption_images-1 | distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:51398 remote=tcp://127.0.0.1:41119>: Stream is closed
caption_images-1 | [2024-03-12 12:36:05,647 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 6))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,647 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 21))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,648 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,648 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,649 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,650 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,650 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:32823. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,650 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51084; closing.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51092; closing.
caption_images-1 | [2024-03-12 12:36:05,651 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51188; closing.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51112; closing.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:35499. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,653 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51224; closing.
caption_images-1 | [2024-03-12 12:36:05,652 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,653 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51120; closing.
caption_images-1 | [2024-03-12 12:36:05,653 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51162; closing.
caption_images-1 | [2024-03-12 12:36:05,653 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 28))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,654 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:40637. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,655 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:43751', name: 1, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6555321')
caption_images-1 | [2024-03-12 12:36:05,656 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:36089', name: 3, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6567404')
caption_images-1 | [2024-03-12 12:36:05,656 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,657 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:37565', name: 2, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6575563')
caption_images-1 | [2024-03-12 12:36:05,658 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:36013', name: 6, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6582437')
caption_images-1 | [2024-03-12 12:36:05,658 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:36305', name: 8, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6589174')
caption_images-1 | [2024-03-12 12:36:05,659 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:44079', name: 10, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.659599')
caption_images-1 | [2024-03-12 12:36:05,660 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:39941', name: 7, status: closing, memory: 0, processing: 1> (stimulus_id='handle-worker-cleanup-1710246965.660282')
caption_images-1 | [2024-03-12 12:36:05,660 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51166; closing.
caption_images-1 | [2024-03-12 12:36:05,661 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51198; closing.
caption_images-1 | [2024-03-12 12:36:05,661 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51140; closing.
caption_images-1 | [2024-03-12 12:36:05,661 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51126; closing.
caption_images-1 | [2024-03-12 12:36:05,663 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:34907', name: 11, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710246965.663433')
caption_images-1 | [2024-03-12 12:36:05,664 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:33881', name: 13, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1710246965.6639524')
caption_images-1 | [2024-03-12 12:36:05,664 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:34039', name: 12, status: closing, memory: 0, processing: 1> (stimulus_id='handle-worker-cleanup-1710246965.6644382')
caption_images-1 | [2024-03-12 12:36:05,665 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:46067', name: 14, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246965.6650739')
caption_images-1 | [2024-03-12 12:36:05,665 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:51194; closing.
caption_images-1 | [2024-03-12 12:36:05,687 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:33603. Reason: nanny-close
caption_images-1 | [2024-03-12 12:36:05,688 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('readparquetfsspec-fused-assign_index-20df642511c6103526f6ab5b6d061f21', 5))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
caption_images-1 | [2024-03-12 12:36:05,690 | distributed.core | INFO] Connection to tcp://127.0.0.1:41119 has been closed.
caption_images-1 | [2024-03-12 12:36:05,707 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:46513. Reason: nanny-close
caption_images-1 | terminate called without an active exception
caption_images-1 | terminate called without an active exception
caption_images-1 | terminate called without an active exception
caption_images-1 | terminate called without an active exception
caption_images-1 | terminate called without an active exception
caption_images-1 | terminate called without an active exception
caption_images-1 | [2024-03-12 12:36:09,286 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:32823', name: 5, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710246969.2867634')
caption_images-1 | [2024-03-12 12:36:09,287 | distributed.core | INFO] Event loop was unresponsive in Nanny for 3.64s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
caption_images-1 | [2024-03-12 12:36:09,288 | distributed.core | INFO] Event loop was unresponsive in Nanny for 3.64s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
caption_images-1 | [2024-03-12 12:36:09,288 | distributed.core | INFO] Event loop was unresponsive in Nanny for 3.64s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
caption_images-1 | [2024-03-12 12:36:09,288 | distributed.core | INFO] Event loop was unresponsive in Nanny for 3.64s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
caption_images-1 | [2024-03-12 12:36:09,288 | distributed.nanny | INFO] Worker process 69 was killed by signal 6
Sorry for the back and forth. Now, a different component is failing in the pipeline due to resource limitation. Currently, it isn't possible to restrict the number of workers or modify the internal Dask configurations of a component. However, this is indeed a good point. I'm going to open an issue for it.
I was able to reproduce your issue. The other components of the pipeline didn't use the GPU. I've updated the component code in #905.
I've also enabled GPU usage in the #15.
Could you try to run the recent pipeline from #15 with fondant run local pipeline.py --build-arg FONDANT_VERSION=52e01191
?
With the latest state you shouldn't run into any issues anymore when you are using a GPU.
@mrchtr thank you for your support. I am getting this error in segment_images module.
segment_images-1 | [2024-03-13 08:30:22,999 | root | WARNING] Failed to infer dtype of index column, falling back to `string`. Specify the dtype explicitly to prevent this.
segment_images-1 | /opt/conda/lib/python3.10/site-packages/distributed/client.py:3162: UserWarning: Sending large graph of size 312.63 MiB.
segment_images-1 | This may cause some slowdown.
segment_images-1 | Consider scattering data ahead of time and using futures.
segment_images-1 | warnings.warn(
segment_images-1 | [2024-03-13 08:30:24,909 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle d41693a2eb024581634321c17f2d38d1 initialized by task ('hash-join-transfer-d41693a2eb024581634321c17f2d38d1', 9) executed on worker tcp://127.0.0.1:35085
segment_images-1 | [2024-03-13 08:30:25,121 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle 8a511a91ff5080f9f10e7b7420ef26e9 initialized by task ('hash-join-transfer-8a511a91ff5080f9f10e7b7420ef26e9', 7) executed on worker tcp://127.0.0.1:35085
segment_images-1 | [2024-03-13 08:30:40,322 | distributed.worker | WARNING] Compute Failed
segment_images-1 | Key: ('getitem-4b131e62a714949ca6adf6f2a3823d7a', 14)
segment_images-1 | Function: subgraph_callable-207c446e-5da8-4c76-952e-e2478d06
segment_images-1 | args: (['segmentation_map'], Index([], dtype='object', name='id'), 'index', 'rename-117349837e705f868d051a0e2d670a42', Empty DataFrame
segment_images-1 | Columns: [image]
segment_images-1 | Index: [], None, 'hash-join-788d62c17e79820a173091c4011a953a')
segment_images-1 | kwargs: {}
segment_images-1 | Exception: "ValueError('No objects to concatenate')"
segment_images-1 |
segment_images-1 | Traceback (most recent call last):
segment_images-1 | File "/opt/conda/bin/fondant", line 8, in <module>
segment_images-1 | sys.exit(entrypoint())
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 89, in entrypoint
segment_images-1 | args.func(args)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/cli.py", line 711, in execute
segment_images-1 | executor.execute(component)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 339, in execute
segment_images-1 | output_manifest = self._run_execution(component_cls, input_manifest)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 301, in _run_execution
segment_images-1 | self._write_data(dataframe=output_df, manifest=output_manifest)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 219, in _write_data
segment_images-1 | data_writer.write_dataframe(dataframe)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 178, in write_dataframe
segment_images-1 | self._write_dataframe(dataframe)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/data_io.py", line 251, in _write_dataframe
segment_images-1 | future.result()
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 322, in result
segment_images-1 | return self.client.sync(self._result, callback_timeout=timeout)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/fondant/component/executor.py", line 495, in wrapped_transform
segment_images-1 | dataframe = transform(dataframe)
segment_images-1 | File "/component/src/./main.py", line 163, in transform
segment_images-1 | return pd.concat(results).to_frame(name="segmentation_map")
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
segment_images-1 | return func(*args, **kwargs)
segment_images-1 | File "/opt/conda/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 425, in __init__
segment_images-1 | raise ValueError("No objects to concatenate")
segment_images-1 | ValueError: No objects to concatenate
segment_images-1 | [2024-03-13 08:30:40,339 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:41419'. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,339 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,340 | distributed.nanny | INFO] Closing Nanny at 'tcp://127.0.0.1:40729'. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,340 | distributed.nanny | INFO] Nanny asking worker to close. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,342 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:43031. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,342 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('hash-join-788d62c17e79820a173091c4011a953a', 8))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
segment_images-1 | [2024-03-13 08:30:40,349 | distributed.core | INFO] Connection to tcp://127.0.0.1:33445 has been closed.
segment_images-1 | [2024-03-13 08:30:40,476 | distributed.worker | INFO] Stopping worker at tcp://127.0.0.1:35085. Reason: nanny-close
segment_images-1 | [2024-03-13 08:30:40,490 | distributed.worker.state_machine | WARNING] Async instruction for <Task cancelled name="execute(('getitem-4b131e62a714949ca6adf6f2a3823d7a', 4))" coro=<Worker.execute() done, defined at /opt/conda/lib/python3.10/site-packages/distributed/worker_state_machine.py:3615>> ended with CancelledError
segment_images-1 | [2024-03-13 08:30:41,098 | distributed.core | INFO] Connection to tcp://127.0.0.1:60934 has been closed.
segment_images-1 | [2024-03-13 08:30:41,098 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:43031', name: 0, status: running, memory: 6, processing: 7> (stimulus_id='handle-worker-cleanup-1710318641.0984302')
segment_images-1 | [2024-03-13 08:30:41,099 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle d41693a2eb024581634321c17f2d38d1 deactivated due to stimulus 'handle-worker-cleanup-1710318641.0984302'
segment_images-1 | [2024-03-13 08:30:41,099 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle 8a511a91ff5080f9f10e7b7420ef26e9 deactivated due to stimulus 'handle-worker-cleanup-1710318641.0984302'
segment_images-1 | [2024-03-13 08:30:41,101 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle d41693a2eb024581634321c17f2d38d1 restarted due to stimulus 'handle-worker-cleanup-1710318641.0984302
segment_images-1 | [2024-03-13 08:30:41,102 | distributed.shuffle._scheduler_plugin | WARNING] Shuffle 8a511a91ff5080f9f10e7b7420ef26e9 restarted due to stimulus 'handle-worker-cleanup-1710318641.0984302
segment_images-1 | [2024-03-13 08:30:41,111 | distributed.core | INFO] Received 'close-stream' from tcp://127.0.0.1:60922; closing.
segment_images-1 | [2024-03-13 08:30:41,111 | distributed.scheduler | INFO] Remove worker <WorkerState 'tcp://127.0.0.1:35085', name: 1, status: closing, memory: 0, processing: 2> (stimulus_id='handle-worker-cleanup-1710318641.111734')
segment_images-1 | [2024-03-13 08:30:41,112 | distributed.scheduler | INFO] Lost all workers
segment_images-1 | [2024-03-13 08:30:41,297 | distributed.core | INFO] Connection to tcp://127.0.0.1:33445 has been closed.
segment_images-1 | [2024-03-13 08:30:42,513 | distributed.scheduler | INFO] Scheduler closing due to unknown reason...
segment_images-1 | [2024-03-13 08:30:42,514 | distributed.scheduler | INFO] Scheduler closing all comms
segment_images-1 exited with code 1
Finished pipeline run.
@vpvsankar I've taken another look into it. It seems that one of your partitions is empty.
Could you try setting the n_rows_to_load
in the pipeline.py
to a higher value, e.g., to 100, and re-run your pipeline?
It should look like this:
prompts = pipeline.read(
GeneratePromptsComponent,
arguments={
"n_rows_to_load": 100
},
)
By using the machine configuration you have mentioned (n1-highmem-16
with 1x NVIDIA T4
) and using a larger dataset to operate on I didn't run into any issues.
Originally posted by @vpvsankar in https://github.com/ml6team/fondant-usecase-controlnet/issues/10#issuecomment-1988434245