LanguageDetector return Unknown for long text [BUG]

jingwora commented 1 year ago

SynapseML version

com.microsoft.azure:synapseml_2.12:0.9.5

System information

Language : pyspark
Spark Platform: Databricks Runtime version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

Describe the problem

From my reproduction code, LanguageDetector can detect only very short words. (Return unknown)

Test results: 日本国 -> Japanese にほんこく - > Japanese 日本国（にほんこく、にっぽんこく - > Unknown 日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3 - > Unknown

I used to use this code and it can detect langugae with a long paragraph. (No change of environment version at all)

This bug occur few days ago.

Could you check what happen? And how can I solve this issue?

Code to reproduce issue

import synapse.ml
from synapse.ml.cognitive import *
from pyspark.sql.functions import col

print(f"synapse.ml.cognitive version:{synapse.ml.cognitive.__version__}")  # synapse.ml.cognitive version:0.9.5

# Set key
key = ''  # API key
location = 'japaneast' # Location

language = (LanguageDetector()
    .setSubscriptionKey(key)
    .setLocation(location)
    .setTextCol("text")
    .setOutputCol("language")
    .setErrorCol("error"))

# Test Text Analytics
test_data = spark.createDataFrame([(1, 'Japan'),
                                   (2, '日本国'),
                                   (3, 'にほんこく'),
                                   (4, '日本国（にほんこく、にっぽんこく'),
                                   (5, '日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。'),
                                  ], ["id", "text"])
# display(test_data)
test_data2 =  language.transform(test_data)
display(test_data2)

### RETURN
# synapse.ml.cognitive version:0.9.5
# 1
# Japan
# null
# [{"detectedLanguage": {"name": "English", "iso6391Name": "en", "confidenceScore": 0.98}, "warnings": [], "statistics": null, "error-message": null}]
# 2
# 日本国
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 3
# にほんこく
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 4
# 日本国（にほんこく、にっぽんこく
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]
# 5
# 日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]

Other info / logs

No response

What component(s) does this bug affect?

[X] area/cognitive: Cognitive project
[ ] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[ ] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[X] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/synapse: Azure Synapse integrations
[ ] integrations/azureml: Azure ML integrations
[X] integrations/databricks: Databricks integrations

github-actions[bot] commented 1 year ago

Hey @jingwora :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

JessicaXYWang commented 1 year ago

Hi @jingwora , confirm that I can repro this issue.

This issue is from Cognitive Service. I can repro this issue without using SynapseML.

key = '' #cognitive service key
endpoint = "" #cognitive service endpoint, eg: https://{yourworkspacename}.cognitiveservices.azure.com/

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint 
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

# Example method for detecting the language of text
def language_detection_example(client):
    try:
        documents = ["日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。"]
        response = client.detect_language(documents = documents, country_hint = 'us')[0]
        print("Language: ", response.primary_language.name)

    except Exception as err:
        print("Encountered exception. {}".format(err))
language_detection_example(client)

I have opened a ticket to Cognitive Service Language Detection team and will keep you updated.

JessicaXYWang commented 1 year ago

Hi @jingwora , Cognitive Service team has acknowledged the issue and they will update the model again soon to address these regressions. For now, you can switch to an older version of the model.

We found an issue setting model version with SynapseML, and @serena-ruan helped to make change https://github.com/microsoft/SynapseML/pull/1756

For now, you can use the latest build with the fix: com.microsoft.azure:synapseml_2.12:0.10.2-14-b205cc47-SNAPSHOT and set the model version with

import synapse.ml
from synapse.ml.cognitive import *
from pyspark.sql.functions import col

# Set key
key = ''  # API key
location = 'japaneast' # Location

language = (LanguageDetector()
    .setSubscriptionKey(key)
    .setLocation(location)
    .setModelVersion("2021-11-20") #previous version
    .setTextCol("text")
    .setOutputCol("language")
    .setErrorCol("error"))

# Test Text Analytics
test_data = spark.createDataFrame([(1, 'Japan'),
                                   (2, '日本国'),
                                   (3, 'にほんこく'),
                                   (4, '日本国（にほんこく、にっぽんこく'),
                                   (5, '日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。'),
                                  ], ["id", "text"])
# display(test_data)
test_data2 =  language.transform(test_data)
display(test_data2)

If you want to test directly with Cognitive Service:

key = '' #cognitive service key
endpoint = "" #cognitive service endpoint, eg: https://{yourworkspacename}.cognitiveservices.azure.com/

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint 
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

# Example method for detecting the language of text
def language_detection_example(client):
    try:
        documents = ["日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。"]
        response = client.detect_language(documents = documents, country_hint = 'us', model_version='2021-11-20')[0]
        print("Language: ", response.primary_language.name)

    except Exception as err:
        print("Encountered exception. {}".format(err))
language_detection_example(client)

I'll let you know after cognitive service team update the model.

jingwora commented 1 year ago

Hi @JessicaXYWang

Thank you for your fast response. Is there any solution without changing to the snapshot built? com.microsoft.azure:synapseml_2.12:0.10.2-14-b205cc47-SNAPSHOT

JessicaXYWang commented 1 year ago

Hi @jingwora It can be automatically fixed when Cognitive Service team release a new version of language detection model.

But if you want to manually set a previous version to fix this issue now, the previous build won't work.

jingwora commented 1 year ago

Hi @JessicaXYWang Thank you for your clearification.

mhamilton723 commented 1 year ago

Thanks @JessicaXYWang for doing this repro, did the cog service yield any errors if so this should show up in the error column of the transformer @jingwora do you see anything in the error column? If not that should be fixed on our side

jingwora commented 1 year ago

@mhamilton723 Thanks for your help! There is no error in error column. Language column show unknown. There are a couple of experiments

English short text: en (correct)
English long text: en (correct)
Japanese short text: ja (correct)
Japanese long text: unknow (incorrect)

JessicaXYWang commented 1 year ago

Thanks @JessicaXYWang for doing this repro, did the cog service yield any errors if so this should show up in the error column of the transformer @jingwora do you see anything in the error column? If not that should be fixed on our side

Hi @mhamilton723 , cog service does not yield errors. According to Cognitive Service Language Detection documentation, The response for languages that cannot be detected is unknown.

microsoft / SynapseML