speechmatics / speechmatics-python

Python library and CLI for Speechmatics
https://speechmatics.github.io/speechmatics-python/
MIT License
57 stars 14 forks source link

English-only Language Detection in Transcription for English/Spanish Audio #80

Closed petiatil closed 9 months ago

petiatil commented 9 months ago

Current behaviour

Only English is detected when transcribing audio (testing Batch transcription) for a 12+ minute English/Spanish video ("en" is the value for "language" in all words in the transcription results)

Steps to Reproduce

Download audio from the YouTube link or GoogleDrive link

Update the audio_file (path to audio file) and speechmaticsAPIkey variables


import speechmatics
import ssl
import certifi

from speechmatics.batch_client import BatchClient

audio_file = "PATH_TO_AUDIO_FILE" 

ssl_context = ssl.create_default_context()
ssl_context.load_verify_locations(certifi.where())

settings = speechmatics.models.ConnectionSettings(
  url="https://asr.api.speechmatics.com/v2",
  auth_token=speechmaticsAPIkey,
    ssl_context=ssl_context,
)

operatingPoint = "enhanced"
expectedLanguages = speechmatics.models.BatchLanguageIdentificationConfig(
            expected_languages=["en", "es"]
          )

LANGUAGE = "auto"

conf = speechmatics.models.BatchTranscriptionConfig(
    language=LANGUAGE,
    operating_point= operatingPoint,
    language_identification_config=expectedLanguages,
)

with BatchClient(settings) as client:
    job_id = client.submit_job(audio=audio_file, transcription_config=conf)
    transcript = client.wait_for_completion(job_id, transcription_format='json-v2')

Expected Behaviour

The 'language' data points in transcript['results'] corresponding to Spanish words are expected to be be "es".

Environment

Mac, Ventura 13.6.2, Python 3.10, standard Python venv.

Other Info

Diarization works for this file, but language detection is still English-only.

petiatil commented 9 months ago

I initially thought the issue was resolved (as likely due to a lack of time per speaker, which may be the case for the 12+ minute file), but when testing the same code with a 30 minute English/Spanish file, using the same inputs (with expectedLanguages as "en" and "es", it lists each word as "es", whether English or Spanish.

My next test will be to see if this will be resolved by enabling diarization (It wasn't used with that test)

petiatil commented 9 months ago

Update: Testing with diarization didn't resolve the issue (all words' language was labeled "es")

nickgerig commented 9 months ago

Hi @petiatil - thanks for the detailed bug description. This is the correct/expected behaviour as we only support single languages per file at the moment. So the auto language functionality will take samples to determine what it thinks the predominant language is, in this case, either en or es, and the results will be labelled as that.

We will start to support a bilingual Spanish/English pack soon, however, that will not label results with the specific language either.

petiatil commented 9 months ago

@nickgerig I see, thank you.

If applicable:

I'm finalizing Speechmatics integration in an app with Real-time and Batch options. Could you clarify if the auto and expected_languages options enhance transcription quality in mixed-language contexts (such as better Spanish spelling when English is predominant [compared to just manually selecting English]), or are they mainly for determining the dominant language model to use?

The app doesn't hinge crucially on this; I'm just aiming for simplicity in offered options (to currently include what could significantly impact transcription quality).

Finally, unless directed otherwise, I'll assume (based on this article) the difference of quality between Real-time and Batch is still currently negligible or equal. I have max_delay set internally to 20.

nickgerig commented 9 months ago

I'm finalizing Speechmatics integration in an app with Real-time and Batch options. Could you clarify if the auto and expected_languages options enhance transcription quality in mixed-language contexts (such as better Spanish spelling when English is predominant [compared to just manually selecting English]), or are they mainly for determining the dominant language model to use?

yes - it's the latter, just about determining the correct language

Finally, unless directed otherwise, I'll assume (based on this article) the difference of quality between Real-time and Batch is still currently negligible or equal. I have max_delay set internally to 20.

Correct - there should be no difference between Real-time and Batch in this case.