microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 830 forks source link

[BUG] DocumentTranslator - No TargetInputs definition #2251

Open haithamshahin333 opened 2 months ago

haithamshahin333 commented 2 months ago

SynapseML version

1.0.4

System information

Describe the problem

Cannot call the DocumentTranslator setTargets param in the constructor. Unclear on what the definition for the TargetInputs object should be in pyspark. How should the targetInputs be defined in pyspark to enable calling DocumentTranslator?

https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/translate/TargetInput.html

Code to reproduce issue

DocumentTranslator() .... .setTargets([{targetUrl: "", language: ""}])

Other info / logs

No response

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

mhamilton723 commented 4 days ago

Try using setTargetsCol("colname")

heres a quick example of how to make a targets column, note that the provided values are just toy examples to show th syntax and should be set for your application


from pyspark.sql.types import StructType, StructField, StringType, ArrayType

# Define the Glossary schema
glossary_schema = StructType([
    StructField("format", StringType(), True),
    StructField("glossaryUrl", StringType(), True),
    StructField("storageSource", StringType(), True),
    StructField("version", StringType(), True)
])

# Define the TargetInput schema
target_input_schema = StructType([
    StructField("category", StringType(), True),
    StructField("glossaries", ArrayType(glossary_schema), True),
    StructField("targetUrl", StringType(), False),
    StructField("language", StringType(), False),
    StructField("storageSource", StringType(), True)
])

from pyspark.sql import Row

# Sample data for the TargetInput column
data = [
    Row(category="Category1",
        glossaries=[
            Row(format="PDF", glossaryUrl="http://example.com/glossary1.pdf", storageSource=None, version="1.0"),
            Row(format="HTML", glossaryUrl="http://example.com/glossary2.html", storageSource="source1", version=None)
        ],
        targetUrl="http://example.com/target1",
        language="en",
        storageSource="sourceA"),

    Row(category=None,
        glossaries=None,
        targetUrl="http://example.com/target2",
        language="fr",
        storageSource=None)
]

df = spark.createDataFrame(data, schema=target_input_schema)
df.show(truncate=False)