Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:
04/01/2024 09:12:23 - INFO - __main__ - ***** Running Labelling *****
04/01/2024 09:12:23 - INFO - __main__ - Instantaneous batch size per device = 8
04/01/2024 09:12:23 - INFO - __main__ - Total eval batch size (w. parallel & distributed) = 16
04/01/2024 09:12:23 - INFO - __main__ - Predict labels with timestamps = True
Evaluating train...: 0%| | 0/52 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
main()
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
eval_step_with_save(split=split)
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
for step, batch in enumerate(batches):
File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1169, in __iter__
for obj in iterable:
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
current_batch = send_to_device(current_batch, self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
return tensor.to(device)
^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
main()
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
eval_step_with_save(split=split)
File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
for step, batch in enumerate(batches):
File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
current_batch = send_to_device(current_batch, self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
return tensor.to(device)
^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
if torch.is_floating_point(v):
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
File "/myenv/distil_whisper/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/myenv/distil_whisper/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
do_one_step()
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
fd = df.detach()
^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
^^^^^^^^^^^^^^^^^^^wandb: You can sync this run to the cloud by running:
wandb: wandb sync /alghome/craig.hsin/framework/distil-whisper/training/wandb/offline-run-20240401_091200-0oe1zyh2
wandb: Find logs at: ./wandb/offline-run-20240401_091200-0oe1zyh2/logs
[2024-04-01 09:12:31,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2703427) of binary: /myenv/distil_whisper/bin/python
Traceback (most recent call last):
File "/myenv/distil_whisper/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
multi_gpu_launcher(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
distrib_run.run(args)
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_pseudo_labelling.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-01_09:12:31
host : alg4
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2703428)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-01_09:12:31
host : alg4
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2703427)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
some of my environment info.
Name Version Build Channel
python 3.11.8 h955ad1f_0
torch 2.1.1+cu118 pypi_0 pypi
transformers 4.39.1 pypi_0 pypi
May you provide some suggestion on how could I proceed the investigations? Thanks.
Dear Author,
Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:
May you provide some suggestion on how could I proceed the investigations? Thanks.