Closed kaahan closed 9 months ago
Hi, can you please install Python 3.8 and rerun the code? Thanks for your interest in our work.
Downgrading to Python 3.8 worked (though I had to install praat-parselmouth
and torchaudio
). However, on running the model, I'm getting poor results. For the attached audio file and this command:
!simuleval \
--agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py \
--source /workspace/source.txt \
--target /workspace/target.txt \
--config config_simul.yaml \
--model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt \
--extract-attn-from-layer 3 \
--frame-num 4 \
--speech-segment-factor 10 \
--output /content/ \
--port 8000 \
--gpu \
--scores
instances.log
has this to say:
{"index": 0, "prediction": "\u266b So bu le : O o h , o o h , o o h , o o h . \u266b O o h , o o h , o o h , o o h . \u266b </s>", "delays": [800.0, 1200.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 1600.0, 3600.0, 4800.0, 4800.0, 4800.0, 4800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 10800.0, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398, 11956.825396825398], "elapsed": [3245.3799724578857, 3832.2
https://github.com/hlt-mt/FBK-fairseq/assets/106453090/8c6a585e-b049-41ca-8b53-5485e73e58da
195529937744, 4439.988946914673, 4442.562913894653, 4444.907760620117, 4448.003625869751, 4450.927114486694, 4453.659391403198, 4456.332778930664, 4459.060287475586, 4461.990690231323, 4464.865064620972, 4467.783546447754, 4470.6488609313965, 4473.50058555603, 4476.436710357666, 4479.426717758179, 4482.651567459106, 4485.540246963501, 4488.520240783691, 7730.323886871338, 9779.305267333984, 9782.306957244873, 9785.19778251648, 9788.330364227295, 20424.693155288696, 20427.18677520752, 20429.49514389038, 20431.778955459595, 20434.066343307495, 20435.98656654358, 22892.254254628744, 22894.257447530355, 22896.099946309652, 22897.93910722884, 22899.78947382125, 22901.618144322958, 22903.44562273177, 22905.49459200057, 22907.07912187728], "prediction_length": 40, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio.wav", "samplerate: 44100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.825396825398, "reference_length": 30, "metric": {"sentence_bleu": 1.205256842736819, "latency": {"AL": -1892.7086181640625, "AP": 0.47255739569664, "DAL": 1866.387451171875}, "latency_ca": {"AL": 1129.2596435546875, "AP": 0.9680058360099792, "DAL": 7387.2373046875}}}
The prediction seems quote non-sensical :-/
https://github.com/hlt-mt/FBK-fairseq/assets/106453090/965e414b-d307-43e9-b1a3-d70355541258
Hi, thatβs strange. Can you please show me the log file (the stdout SimulEval produces)? Also, SimulEval and our models work with wav files with 1 channel and 16kHz of sampling rate (standard conversion). Can you please try to convert the audio, which is in mp4, using this settings and rerun the script? Thanks
Updated instance.log
after changing to 16khz, 1 channel (I was already using a wav
file, GitHub would only let me upload mp4 π):
{"index": 0, "prediction": "\u266b en la tierra , en el campo , en el cielo , en el cielo , en la tierra . \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b \u266b en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo ,en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , en el cielo , \u266b </s>", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2400.0, 2400.0, 3600.0, 3600.0, 3600.0, 3600.0, 3600.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 5200.0, 10000.0, 10000.0, 10000.0, 10000.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 10400.0, 11200.0, 11200.0, 11200.0, 11200.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645,11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4843.686819076538, 4845.778942108154, 4848.869800567627, 4851.950168609619, 4854.800462722778, 4857.587575912476, 4860.589981079102, 5412.688636779785, 5415.485048294067, 7128.985023498535, 7131.087636947632, 7132.9320430755615, 7135.399436950684, 7138.24520111084, 9567.100238800049, 9569.137287139893, 9571.778964996338, 9574.652862548828, 9577.43592262268, 9580.456686019897, 9583.590698242188, 17199.337005615234, 17201.39741897583, 17203.248262405396, 17205.052614212036, 18548.866176605225, 18551.240348815918, 18553.430461883545, 18556.083583831787, 18558.77676010132, 18561.42511367798, 18564.08109664917, 18566.757345199585, 18569.596672058105, 18572.455549240112, 18575.34852027893, 18578.041458129883, 21721.597385406494, 21724.094104766846, 21726.03816986084, 21729.12187576294, 24012.092241145067, 24014.087804652147, 24015.938171244554, 24017.74323830693, 24019.595273829393, 24022.0617140302, 24025.21432290166, 24028.136381007127, 24030.958064890794, 24033.733257151536, 24036.878713465623, 24039.005884028367, 24040.777334071092, 24042.553790903978, 24044.304021693162, 24046.05115304082, 24049.00205979436, 24051.829704142503, 24054.84212288945, 24907.369502879075, 24909.839996195726, 24912.148126460008, 24914.335616923265, 24917.053350306443, 24919.854768610887, 24922.658332682542, 24925.372489787034, 24928.38824639409, 24931.31292710393, 24934.21090493291, 24936.883338786058, 24939.5204866895, 24942.180761195115, 24945.013650752, 24947.85035500615, 24950.693973399095, 24953.515418864183, 24956.343540049485, 24959.166416026048, 24961.803087092332, 24964.39588914006, 24967.00776467412, 24969.601281977586, 24972.40246186345, 24975.23964295476, 24978.075393534593, 24980.874427653245, 24983.71232400029, 24986.548789835862, 24989.34925446599, 24992.193111277513, 24995.02266297429, 24997.637876368455, 25000.252136088304, 25002.96033272832, 25005.57888398259, 25008.380302287034, 25011.21819863408, 25014.04178986638, 25016.85250649541, 25019.682058192186, 25022.505887843065, 25025.308498240403, 25027.948268748216, 25030.544170237474, 25033.13029656499, 25035.96151719182, 25038.79512200444, 25041.599401331834, 25044.409164286546, 25047.05918679326, 25049.724229670457, 25052.331336832933, 25055.132993555955, 25057.92654404729, 25060.739883280687, 25063.449272013597, 25066.04612717717, 25068.717607356004, 25071.35928521245, 25074.152120448045, 25077.24011788457, 25079.786189890794], "prediction_length": 124, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645, "reference_length": 30, "metric": {"sentence_bleu": 0.7415472433597086, "latency": {"AL": -282.8108215332031, "AP": 0.8600947260856628, "DAL": 7117.89599609375}, "latency_ca": {"AL": 3435.63134765625, "AP": 1.7415765523910522, "DAL": 17131.087890625}}}
Here's the stdout
from SimulEval:
(workspace-3.8) root@05ee56face0f:/workspace/FBK-fairseq# simuleval --agent examples/speech_to_text/simultaneous_translation/agents/v1_0/simul_offline_alignatt.py --source /workspace/source.txt --target /workspace/target.txt --data-bin /workspace/FBK-fairseq/checkpoint/ --config config_simul.yaml --model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt --extract-attn-from-layer 3 --frame-num 4 --speech-segment-factor 10 --output /content/ --port 8000 --gpu --scores
2023-10-26 22:55:28 | INFO | simuleval.scorer | Evaluating on speech
2023-10-26 22:55:28 | INFO | simuleval.scorer | Source: /workspace/source.txt
2023-10-26 22:55:28 | INFO | simuleval.scorer | Target: /workspace/target.txt
2023-10-26 22:55:28 | INFO | simuleval.scorer | Number of sentences: 1
2023-10-26 22:55:28 | INFO | simuleval.server | Evaluation Server Started (process id 3964). Listening to port 8000
2023-10-26 22:55:31 | WARNING | simuleval.scorer | Resetting scorer
2023-10-26 22:55:31 | INFO | simuleval.cli | Output dir: /content/
2023-10-26 22:55:31 | INFO | simuleval.cli | Start data writer (process id 3970)
2023-10-26 22:55:31 | INFO | simuleval.cli | Evaluating AlignAttSTAgent (process id 3902) on instances from 0 to 0
2023-10-26 22:55:37 | INFO | examples.speech_to_text.tasks.speech_to_text_ctc | target dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt): 8,000
2023-10-26 22:55:37 | INFO | examples.speech_to_text.tasks.speech_to_text_ctc | source dictionary size (/workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt): 5,002
2023-10-26 22:55:54 | INFO | simuleval.cli | Evaluation results:
{
"Quality": {
"BLEU": 0.7659623558516302
},
"Latency": {
"AL": -282.8108215332031,
"AL_CA": 3435.63134765625,
"AP": 0.8600947260856628,
"AP_CA": 1.7415765523910522,
"DAL": 7117.89599609375,
"DAL_CA": 17131.087890625
}
}
2023-10-26 22:55:54 | INFO | simuleval.cli | Evaluation finished
2023-10-26 22:55:54 | INFO | simuleval.cli | Close data writer
2023-10-26 22:55:54 | INFO | simuleval.cli | Shutdown server
Here is my configuration if that's helpful:
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.model
bpe_tokenizer_src:
bpe: sentencepiece
sentencepiece_model: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.model
global_cmvn:
stats_npz_path: /workspace/FBK-fairseq/checkpoint/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
- global_cmvn
_train:
- global_cmvn
- specaugment
vocab_filename: /workspace/FBK-fairseq/checkpoint/spm_unigram8000_st_target.txt
vocab_filename_src: /workspace/FBK-fairseq/checkpoint/spm_unigram.en.txt
Hi, I noticed an error in the README (the --speech-segment-factor
has to be 25) and in the scripts working with the "old" version of SimulEval. I'm working on fixing them, thanks for pointing it out.
By the way, we have the new version of the code which works with the new SimulEval, you can find it here. I tried it on your audio file and our model works as expected but the performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.
Hm, when trying to run the instructions on the new version of SimulEval, I'm running into the following error:
root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/lib/python3.8/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Traceback (most recent call last):
File "/root/.local/share/pdm/venvs/workspace-6rDWGpm2-fairseq/bin/simuleval", line 33, in <module>
sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
File "/workspace/SimulEval/simuleval/cli.py", line 47, in main
system, args = build_system_args()
File "/workspace/SimulEval/simuleval/utils/agent.py", line 138, in build_system_args
system_class.add_args(parser)
File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/v1_1/simul_offline_edatt.py", line 51, in add_args
BaseSimulSTAgent.add_args(parser)
File "/workspace/FBK-fairseq/examples/speech_to_text/simultaneous_translation/agents/base_simulst_agent.py", line 84, in add_args
parser.add_argument("--user-dir", type=str, default="examples/simultaneous_translation",
File "/usr/lib/python3.8/argparse.py", line 1398, in add_argument
return self._add_action(action)
File "/usr/lib/python3.8/argparse.py", line 1761, in _add_action
self._optionals._add_action(action)
File "/usr/lib/python3.8/argparse.py", line 1602, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "/usr/lib/python3.8/argparse.py", line 1412, in _add_action
self._check_conflict(action)
File "/usr/lib/python3.8/argparse.py", line 1551, in _check_conflict
conflict_handler(action, confl_optionals)
File "/usr/lib/python3.8/argparse.py", line 1560, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --user-dir: conflicting option string: --user-dir
This is with the following run command:
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
--source /workspace/source.txt \
--target /workspace/target.txt \
--data-bin /workspace/FBK-fairseq/checkpoint/ \
--config config_simul.yaml \
--model-path /workspace/FBK-fairseq/checkpoint/checkpoint_avg7.pt --prefix-size 1 --prefix-token "nomt" \
--extract-attn-from-layer 3 --frame-num 4 \
--source-segment-size 1000 \
--device cuda:0 \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
--output /content/
performance is poor, mostly because it has been trained only on MuST-C and it is not intended to be robust out of domain.
I'm a bit confused by this -- isn't MuST-C a TED based dataset? It should have reverb, some crowd noise, etc. that would appear to be harder than the audio I've sent.
Removing the --user-dir
argument from base_simulst_agent.py
fixed this (though that seems a little suspect). I'm getting the following result:
{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4337.5279903411865, 4337.5279903411865, 4337.5279903411865, 5465.068101882935, 5465.068101882935, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 6640.028238296509, 9043.581485748291, 9043.581485748291, 10249.344110488892, 10249.344110488892, 10249.344110488892, 12713.966131210327, 12713.966131210327, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 14004.980564117432, 15320.341110229492, 15320.341110229492, 16591.46321663945, 16591.46321663945, 16591.46321663945, 16591.46321663945], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}
which appears a little more reasonable, but quite poor still.
Running it on cleaned audio (i.e. background removed), results in better results, though it does seem to struggle with proper nouns :-)
{"index": 0, "prediction": "fuerte como el video de un ni\u00f1o, probablemente nunca lo hemos escrito tan pronto como sea posible, estamos fuera de un solo mundo, aunque no est\u00e1bamos en ninguno de nosotros.", "delays": [2000.0, 2000.0, 2000.0, 3000.0, 3000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 6000.0, 6000.0, 7000.0, 7000.0, 7000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [4260.651111602783, 4260.651111602783, 4260.651111602783, 5385.631322860718, 5385.631322860718, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 6557.546615600586, 8954.18643951416, 8954.18643951416, 10157.024145126343, 10157.024145126343, 10157.024145126343, 12613.478422164917, 12613.478422164917, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 13899.338483810425, 15209.887981414795, 15209.887981414795, 16476.026662684373, 16476.026662684373, 16476.026662684373, 16476.026662684373], "prediction_length": 30, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}
{"index": 1, "prediction": "fuerte: Esta es una prueba de la globalizaci\u00f3n de video, probablemente tiene ese gui\u00f3n ah\u00ed, as\u00ed que vamos a probar otra cosa. Estamos en un octubre. \u00bfPor qu\u00e9 trabajamos en la oficina de Apple?", "delays": [2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 2000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 4000.0, 5000.0, 5000.0, 5000.0, 5000.0, 6000.0, 6000.0, 6000.0, 7000.0, 7000.0, 8000.0, 9000.0, 9000.0, 9000.0, 10000.0, 10000.0, 10000.0, 11000.0, 11000.0, 11956.832298136645, 11956.832298136645, 11956.832298136645, 11956.832298136645], "elapsed": [2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 2247.722625732422, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 4582.084178924561, 5820.873022079468, 5820.873022079468, 5820.873022079468, 5820.873022079468, 7076.664209365845, 7076.664209365845, 7076.664209365845, 8353.463888168335, 8353.463888168335, 9646.378993988037, 10938.165664672852, 10938.165664672852, 10938.165664672852, 12264.750719070435, 12264.750719070435, 12264.750719070435, 13640.005826950073, 13640.005826950073, 14967.794545985156, 14967.794545985156, 14967.794545985156, 14967.794545985156], "prediction_length": 34, "reference": "Esta es una prueba de localizaci\u00f3n de video. Probablemente tengan ese gui\u00f3n ah\u00ed, as\u00ed que intentemos algo m\u00e1s. Estamos en 1 Culver, debajo de WeWork en la oficina de Apple.", "source": ["/workspace/one_culver_audio_cleaned_16khz.wav", "samplerate: 16100 Hz", "channels: 1", "duration: 11.957 s", "format: WAV (Microsoft) [WAV]", "subtype: Signed 16 bit PCM [PCM_16]"], "source_length": 11956.832298136645}
It does have this odd property of adding fuerte:
in front of the translations -- is this an artifact of MuST-C?
Hi, you should remove --prefix-size 1 --prefix-token "nomt"
if you are not using the IWSLT 2023 models (which were trained with the language id preprended as the first token). Please remove these and rerun the code.
Regarding your issue with --user-dir
, I am not able to replicate it locally at the moment. Can you please share your environment?
Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).
Removing the prefix
related arguments removed the fuerte
s, thanks!
Regarding your issue with --user-dir, I am not able to replicate it locally at the moment. Can you please share your environment?
I'm not sure which parts of my environment you'd like replicated, but my pip freeze
looks like:
antlr4-python3-runtime==4.8
bitarray==2.6.0
Brotli==1.1.0
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.1
colorama==0.4.6
coverage==7.3.2
ctc-segmentation==1.7.4
Cython==3.0.4
exceptiongroup==1.1.3
fairseq==1.0.0a0+4b7966b
filelock==3.12.4
flake8==6.1.0
fsspec==2023.10.0
hydra-core==1.0.7
idna==3.4
importlib-resources==6.1.0
iniconfig==2.0.0
Jinja2==3.1.2
lxml==4.9.3
MarkupSafe==2.1.3
mccabe==0.7.0
mpmath==1.3.0
mutagen==1.47.0
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
omegaconf==2.0.6
packaging==23.2
pandas==2.0.3
pluggy==1.3.0
portalocker==2.0.0
praat-parselmouth==0.4.3
pycodestyle==2.11.1
pycparser==2.21
pycryptodomex==3.19.0
pydub==0.25.1
pyflakes==3.1.0
pytest==7.4.3
pytest-cov==4.1.0
pytest-flake8==1.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
sacrebleu==2.3.1
-e git+https://github.com/facebookresearch/SimulEval.git@411a73d60d0626d8519f58d02a284fb53a263cad#egg=simuleval
six==1.16.0
soundfile==0.12.1
srt==3.5.3
sympy==1.12
tabulate==0.9.0
TextGrid==1.5
tomli==2.0.1
torch==2.1.0
torchaudio==2.1.0
tornado==6.3.3
tqdm==4.64.1
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
websockets==12.0
yt-dlp==2023.10.13
zipp==3.17.0
Regarding our models, they have not been developed to be competitive with production systems, for building strong models you need to train them with thousands of hours of audio, while MuST-C is only composed of 200/300 hours of high-quality and clean audio (with no background noise).
And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?
Hi, I believe the error is related to the version of SimulEval. If you install the tool from the commit in the guide, you should be able to solve the issue for the --user-dir
.
And totally reasonable on the competitive with production systems side -- do you feel like model architecture, as is, would scale well to thousands of hours of audio?
I think that models that are trained with thousands of data, such as Whisper, are not much different from our model architecture. Whisper Small has 12 layers of encoder, just like our model, even if we have a Conformer instead of a Transformer. Of course, if you want to scale to much more data, bigger models are better, in general.
I am closing this as it has been stale for a while. Feel free to reopen if anything else is needed.
π Bug
Hey! I tried following the instructions here to run the AlignATT agent on the
en->es
direction model. Igit clone
and (editable) installed this repo andSimulEval
, downloaded the checkpoint and all the associated metadata files to/workspace/FBK-fairseq/checkpoint/
and ran the following command:and got the following error
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Code sample
Expected behavior
Environment
master
pip
, source):Additional context