TypeError: cannot unpack non-iterable NoneType object

🐛 Bug

Hey! I tried following the instructions here to run the AlignATT agent on the en->es direction model. I git clone and (editable) installed this repo and SimulEval, downloaded the checkpoint and all the associated metadata files to /workspace/FBK-fairseq/checkpoint/ and ran the following command:

!simuleval \
    --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py \
    --source /workspace/source.txt \
    --target /workspace/target.txt \
    --config config_simul.yaml \
    --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt \
    --extract-attn-from-layer 3 \
    --frame-num 4 \
    --speech-segment-factor 10 \
    --output /content/ \
    --port 8000 \
    --gpu \
    --scores

and got the following error

Traceback (most recent call last):
  File "/usr/local/bin/simuleval", line 33, in <module>
    sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
  File "/workspace/SimulEval/simuleval/cli.py", line 165, in main
    _main(args.client_only)
  File "/workspace/SimulEval/simuleval/cli.py", line 180, in _main
    _, agent_cls = find_agent_cls(args)
  File "/workspace/SimulEval/simuleval/utils/agent_finder.py", line 64, in find_agent_cls
    spec.loader.exec_module(agent_modules)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py", line 17, in <module>
    from examples.speech_to_text.simultaneous_translation.agents.v1_0.simul_offline_edatt import EDAttSTAgent
  File "/workspace/FBK-fairseq/examples/speech_to_text/__init__.py", line 6, in <module>
    from . import tasks, criterions, models, modules  # noqa
  File "/workspace/FBK-fairseq/examples/speech_to_text/tasks/__init__.py", line 7, in <module>
    importlib.import_module('examples.speech_to_text.tasks.' + task_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/workspace/FBK-fairseq/examples/speech_to_text/tasks/speech_translation_dualdecoding.py", line 17, in <module>
    from examples.speech_to_text.inference.twophase_sequence_generator import TwoPhaseSequenceGenerator
  File "/workspace/FBK-fairseq/examples/speech_to_text/inference/twophase_sequence_generator.py", line 21, in <module>
    from examples.speech_to_text.models.base_triangle_with_prev_tags import BaseTrianglePreviousTags
  File "/workspace/FBK-fairseq/examples/speech_to_text/models/__init__.py", line 7, in <module>
    importlib.import_module('examples.speech_to_text.models.' + model_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/workspace/FBK-fairseq/examples/speech_to_text/models/speechformer_triangle.py", line 14, in <module>
    from examples.speech_to_text.models.base_triangle import BaseTriangle
  File "/workspace/FBK-fairseq/examples/speech_to_text/models/base_triangle.py", line 20, in <module>
    from examples.speech_to_text.modules.triangle_transformer_layer import TriangleTransformerDecoderLayer
  File "/workspace/FBK-fairseq/examples/speech_to_text/modules/__init__.py", line 7, in <module>
    importlib.import_module('examples.speech_to_text.modules.' + module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/workspace/FBK-fairseq/examples/speech_to_text/modules/transformer_layer_penalty.py", line 10, in <module>
    from examples.speech_to_text.modules.local_attention import LocalAttention
  File "/workspace/FBK-fairseq/examples/speech_to_text/modules/local_attention.py", line 11, in <module>
    from fairseq import utils
  File "/workspace/FBK-fairseq/fairseq/__init__.py", line 33, in <module>
    import fairseq.optim  # noqa
  File "/workspace/FBK-fairseq/fairseq/optim/__init__.py", line 27, in <module>
    (
TypeError: cannot unpack non-iterable NoneType object

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

Code sample

Expected behavior

Environment

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0)
OS (e.g., Linux): Linux
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration: 4090
Any other relevant information:

Additional context

Hi, can you please install Python 3.8 and rerun the code? Thanks for your interest in our work.

Downgrading to Python 3.8 worked (though I had to install praat-parselmouth and torchaudio). However, on running the model, I'm getting poor results. For the attached audio file and this command:

!simuleval \
    --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py \
    --source /workspace/source.txt \
    --target /workspace/target.txt \
    --config config_simul.yaml \
    --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt \
    --extract-attn-from-layer 3 \
    --frame-num 4 \
    --speech-segment-factor 10 \
    --output /content/ \
    --port 8000 \
    --gpu \
    --scores

instances.log has this to say:

{"index": 0, "prediction": "\u266b So bu le : O o h , o o h , o o h , o o h . \u266b O o h , o o h , o o h , o o h . \u266b </s>", "delays": [800.0, 1200.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 3600.0, 4800.0, 4800.0, 4800.0, 4800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398], "elapsed": [3245.3799724578857, 3832.2

https://github.com/hlt-mt/FBK-fairseq/assets/106453090/8c6a585e-b049-41ca-8b53-5485e73e58da

195529937744, 4439.988946914673, 4442.562913894653, 4444.907760620117, 4448.003625869751, 4450.927114486694, 4453.659391403198, 4456.332778930664, 4459.060287475586, 4461.990690231323, 4464.865064620972, 4467.783546447754, 4470.6488609313965, 4473.50058555603, 4476.436710357666, 4479.426717758179, 4482.651567459106, 4485.540246963501, 4488.520240783691, 7730.323886871338, 9779.305267333984, 9782.306957244873, 9785.19778251648, 9788.330364227295, 20424.693155288696, 20427.18677520752, 20429.49514389038, 20431.778955459595, 20434.066343307495, 20435.98656654358, 22892.254254628744, 22894.257447530355, 22896.099946309652, 22897.93910722884, 22899.78947382125, 22901.618144322958, 22903.44562273177, 22905.49459200057, 22907.07912187728], "prediction_length": 40, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio.wav", "samplerate: 44100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.825396825398, "reference_length": 30, "metric": {"sentence_bleu": 1.205256842736819, "latency": {"AL": -1892.7086181640625, "AP": 0.47255739569664, "DAL": 1866.387451171875}, "latency_ca": {"AL": 1129.2596435546875, "AP": 0.9680058360099792, "DAL": 7387.2373046875}}}

The prediction seems quote non-sensical :-/

https://github.com/hlt-mt/FBK-fairseq/assets/106453090/965e414b-d307-43e9-b1a3-d70355541258

Hi, that’s strange. Can you please show me the log file (the stdout SimulEval produces)? Also, SimulEval and our models work with wav files with 1 channel and 16kHz of sampling rate (standard conversion). Can you please try to convert the audio, which is in mp4, using this settings and rerun the script? Thanks

Updated instance.log after changing to 16khz, 1 channel (I was already using a wav file, GitHub would only let me upload mp4 🙃):

{"index": 0, "prediction": "\u266b en la tierra , en el campo , en el cielo , en el cielo , en la tierra . \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo ,en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , \u266b </s>", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2400.0, 2400.0, 3600.0, 3600.0, 3600.0, 3600.0, 3600.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 10000.0, 10000.0, 10000.0, 10000.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 11200.0, 11200.0, 11200.0, 11200.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645,11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4843.686819076538, 4845.778942108154, 4848.869800567627, 4851.950168609619, 4854.800462722778, 4857.587575912476, 4860.589981079102, 5412.688636779785, 5415.485048294067, 7128.985023498535, 7131.087636947632, 7132.9320430755615, 7135.399436950684, 7138.24520111084, 9567.100238800049, 9569.137287139893, 9571.778964996338, 9574.652862548828, 9577.43592262268, 9580.456686019897, 9583.590698242188, 17199.337005615234, 17201.39741897583, 17203.248262405396, 17205.052614212036, 18548.866176605225, 18551.240348815918, 18553.430461883545, 18556.083583831787, 18558.77676010132, 18561.42511367798, 18564.08109664917, 18566.757345199585, 18569.596672058105, 18572.455549240112, 18575.34852027893, 18578.041458129883, 21721.597385406494, 21724.094104766846, 21726.03816986084, 21729.12187576294, 24012.092241145067, 24014.087804652147, 24015.938171244554, 24017.74323830693, 24019.595273829393, 24022.0617140302, 24025.21432290166, 24028.136381007127, 24030.958064890794, 24033.733257151536, 24036.878713465623, 24039.005884028367, 24040.777334071092, 24042.553790903978, 24044.304021693162, 24046.05115304082, 24049.00205979436, 24051.829704142503, 24054.84212288945, 24907.369502879075, 24909.839996195726, 24912.148126460008, 24914.335616923265, 24917.053350306443, 24919.854768610887, 24922.658332682542, 24925.372489787034, 24928.38824639409, 24931.31292710393, 24934.21090493291, 24936.883338786058, 24939.5204866895, 24942.180761195115, 24945.013650752, 24947.85035500615, 24950.693973399095, 24953.515418864183, 24956.343540049485, 24959.166416026048, 24961.803087092332, 24964.39588914006, 24967.00776467412, 24969.601281977586, 24972.40246186345, 24975.23964295476, 24978.075393534593, 24980.874427653245, 24983.71232400029, 24986.548789835862, 24989.34925446599, 24992.193111277513, 24995.02266297429, 24997.637876368455, 25000.252136088304, 25002.96033272832, 25005.57888398259, 25008.380302287034, 25011.21819863408, 25014.04178986638, 25016.85250649541, 25019.682058192186, 25022.505887843065, 25025.308498240403, 25027.948268748216, 25030.544170237474, 25033.13029656499, 25035.96151719182, 25038.79512200444, 25041.599401331834, 25044.409164286546, 25047.05918679326, 25049.724229670457, 25052.331336832933, 25055.132993555955, 25057.92654404729, 25060.739883280687, 25063.449272013597, 25066.04612717717, 25068.717607356004, 25071.35928521245, 25074.152120448045, 25077.24011788457, 25079.786189890794], "prediction_length": 124, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645, "reference_length": 30, "metric": {"sentence_bleu": 0.7415472433597086, "latency": {"AL": -282.8108215332031, "AP": 0.8600947260856628, "DAL": 7117.89599609375}, "latency_ca": {"AL": 3435.63134765625, "AP": 1.7415765523910522, "DAL": 17131.087890625}}}

Here's the stdout from SimulEval:

(workspace-3.8) root@05ee56face0f:/workspace/FBK-fairseq# simuleval     --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py     --source /workspace/source.txt     --target /workspace/target.txt --data-bin /workspace/FBK-fairseq/checkpoint/     --config config_simul.yaml     --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt     --extract-attn-from-layer 3     --frame-num 4     --speech-segment-factor 10     --output /content/     --port 8000     --gpu     --scores
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Evaluating on speech
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Source: /workspace/source.txt
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Target: /workspace/target.txt
2023-10-26 22:55:28 | INFO     | simuleval.scorer | Number of sentences: 1
2023-10-26 22:55:28 | INFO     | simuleval.server | Evaluation Server Started (process id 3964). Listening to port 8000
2023-10-26 22:55:31 | WARNING  | simuleval.scorer | Resetting scorer
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Output dir: /content/
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Start data writer (process id 3970)
2023-10-26 22:55:31 | INFO     | simuleval.cli    | Evaluating AlignAttSTAgent (process id 3902) on instances from 0 to 0
2023-10-26 22:55:37 | INFO     | examples.speech_to_text.tasks.speech_to_text_ctc | target dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt): 8,000
2023-10-26 22:55:37 | INFO     | examples.speech_to_text.tasks.speech_to_text_ctc | source dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt): 5,002
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Evaluation results:
{
    "Quality": {
        "BLEU": 0.7659623558516302
    },
    "Latency": {
        "AL": -282.8108215332031,
        "AL_CA": 3435.63134765625,
        "AP": 0.8600947260856628,
        "AP_CA": 1.7415765523910522,
        "DAL": 7117.89599609375,
        "DAL_CA": 17131.087890625
    }
}
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Evaluation finished
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Close data writer
2023-10-26 22:55:54 | INFO     | simuleval.cli    | Shutdown server

Here is my configuration if that's helpful:

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.model
bpe_tokenizer_src:
  bpe: sentencepiece
  sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.model
global_cmvn:
  stats_npz_path: /workspace/FBK-fairseq/checkpoint/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocab_filename: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt
vocab_filename_src: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt

Hi, I noticed an error in the README (the --speech-segment-factor has to be 25) and in the scripts working with the "old" version of SimulEval. I'm working on fixing them, thanks for pointing it out. By the way, we have the new version of the code which works with the new SimulEval, you can find it here. I tried it on your audio file and our model works as expected but the performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.

Hm, when trying to run the instructions on the new version of SimulEval, I'm running into the following error:

root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/lib/python3.8/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Traceback (most recent call last):
  File "/root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/bin/simuleval", line 33, in <module>
    sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
  File "/workspace/SimulEval/simuleval/cli.py", line 47, in main
    system, args = build_system_args()
  File "/workspace/SimulEval/simuleval/utils/agent.py", line 138, in build_system_args
    system_class.add_args(parser)
  File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/v1_1/simul_offline_edatt.py", line 51, in add_args
    BaseSimulSTAgent.add_args(parser)
  File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/base_simulst_agent.py", line 84, in add_args
    parser.add_argument("--user-dir", type=str, default="examples/simultaneous_translation",
  File "/usr/lib/python3.8/argparse.py", line 1398, in add_argument
    return self._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1761, in _add_action
    self._optionals._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1602, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "/usr/lib/python3.8/argparse.py", line 1412, in _add_action
    self._check_conflict(action)
  File "/usr/lib/python3.8/argparse.py", line 1551, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "/usr/lib/python3.8/argparse.py", line 1560, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --user-dir: conflicting option string: --user-dir

This is with the following run command:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
    --source /workspace/source.txt \
    --target /workspace/target.txt \
    --data-bin /workspace/FBK-fairseq/checkpoint/ \
    --config config_simul.yaml \
    --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt --prefix-size 1 --prefix-token "nomt" \
    --extract-attn-from-layer 3 --frame-num 4 \
    --source-segment-size 1000 \
    --device cuda:0 \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
    --output /content/

performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.

I'm a bit confused by this -- isn't MuST-C a TED based dataset? It should have reverb, some crowd noise, etc. that would appear to be harder than the audio I've sent.

Removing the --user-dir argument from base_simulst_agent.py fixed this (though that seems a little suspect). I'm getting the following result:

{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4337.5279903411865, 4337.5279903411865, 4337.5279903411865, 5465.068101882935, 5465.068101882935, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 9043.581485748291, 9043.581485748291, 10249.344110488892, 10249.344110488892, 10249.344110488892, 12713.966131210327, 12713.966131210327, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 15320.341110229492, 15320.341110229492, 16591.46321663945, 16591.46321663945, 16591.46321663945, 16591.46321663945], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}

which appears a little more reasonable, but quite poor still.

Running it on cleaned audio (i.e. background removed), results in better results, though it does seem to struggle with proper nouns :-)

{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4260.651111602783, 4260.651111602783, 4260.651111602783, 5385.631322860718, 5385.631322860718, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 8954.18643951416, 8954.18643951416, 10157.024145126343, 10157.024145126343, 10157.024145126343, 12613.478422164917, 12613.478422164917, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 15209.887981414795, 15209.887981414795, 16476.026662684373, 16476.026662684373, 16476.026662684373, 16476.026662684373], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}
{"index": 1, "prediction": "fuerte: Esta es una prueba de la globalizaci\u00f3n de video, probablemente tiene ese gui\u00f3n ah\u00ed, as\u00ed que vamos a probar otra cosa. Estamos en un octubre. \u00bfPor qu\u00e9 trabajamos en la oficina de Apple?", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 5000.0, 5000.0, 5000.0, 5000.0, 6000.0, 6000.0, 6000.0, 7000.0, 7000.0, 8000.0, 9000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 5820.873022079468, 5820.873022079468, 5820.873022079468, 5820.873022079468, 7076.664209365845, 7076.664209365845, 7076.664209365845, 8353.463888168335, 8353.463888168335, 9646.378993988037, 10938.165664672852, 10938.165664672852, 10938.165664672852, 12264.750719070435, 12264.750719070435, 12264.750719070435, 13640.005826950073, 13640.005826950073, 14967.794545985156, 14967.794545985156, 14967.794545985156, 14967.794545985156], "prediction_length": 34, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_cleaned_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}

It does have this odd property of adding fuerte: in front of the translations -- is this an artifact of MuST-C?

Hi, you should remove --prefix-size 1 --prefix-token "nomt" if you are not using the IWSLT 2023 models (which were trained with the language id preprended as the first token). Please remove these and rerun the code.

Regarding your issue with --user-dir, I am not able to replicate it locally at the moment. Can you please share your environment?

Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).

Removing the prefix related arguments removed the fuertes, thanks!

Regarding your issue with --user-dir, I am not able to replicate it locally at the moment. Can you please share your environment?

I'm not sure which parts of my environment you'd like replicated, but my pip freeze looks like:

antlr4-python3-runtime==4.8
bitarray==2.6.0
Brotli==1.1.0
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.1
colorama==0.4.6
coverage==7.3.2
ctc-segmentation==1.7.4
Cython==3.0.4
exceptiongroup==1.1.3
fairseq==1.0.0a0+4b7966b
filelock==3.12.4
flake8==6.1.0
fsspec==2023.10.0
hydra-core==1.0.7
idna==3.4
importlib-resources==6.1.0
iniconfig==2.0.0
Jinja2==3.1.2
lxml==4.9.3
MarkupSafe==2.1.3
mccabe==0.7.0
mpmath==1.3.0
mutagen==1.47.0
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
omegaconf==2.0.6
packaging==23.2
pandas==2.0.3
pluggy==1.3.0
portalocker==2.0.0
praat-parselmouth==0.4.3
pycodestyle==2.11.1
pycparser==2.21
pycryptodomex==3.19.0
pydub==0.25.1
pyflakes==3.1.0
pytest==7.4.3
pytest-cov==4.1.0
pytest-flake8==1.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
sacrebleu==2.3.1
-e git+https://github.com/facebookresearch/SimulEval.git@411a73d60d0626d8519f58d02a284fb53a263cad#egg=simuleval
six==1.16.0
soundfile==0.12.1
srt==3.5.3
sympy==1.12
tabulate==0.9.0
TextGrid==1.5
tomli==2.0.1
torch==2.1.0
torchaudio==2.1.0
tornado==6.3.3
tqdm==4.64.1
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
websockets==12.0
yt-dlp==2023.10.13
zipp==3.17.0

Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).

And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?

Hi, I believe the error is related to the version of SimulEval. If you install the tool from the commit in the guide, you should be able to solve the issue for the --user-dir.

And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?

I think that models that are trained with thousands of data, such as Whisper, are not much different from our model architecture. Whisper Small has 12 layers of encoder, just like our model, even if we have a Conformer instead of a Transformer. Of course, if you want to scale to much more data, bigger models are better, in general.

I am closing this as it has been stale for a while. Feel free to reopen if anything else is needed.

hlt-mt / FBK-fairseq