[BUG] DocumentTranslator - No TargetInputs definition

microsoft / SynapseML

Simple and Distributed Machine Learning

MIT License

5.06k stars 830 forks source link

SynapseML version

1.0.4

System information

Language version (e.g. python 3.8, scala 2.12):
Spark Version (e.g. 3.2.3):
Spark Platform (Synapse):

Describe the problem

Cannot call the DocumentTranslator setTargets param in the constructor. Unclear on what the definition for the TargetInputs object should be in pyspark. How should the targetInputs be defined in pyspark to enable calling DocumentTranslator?

https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/translate/TargetInput.html

Code to reproduce issue

DocumentTranslator() .... .setTargets([{targetUrl: "", language: ""}])

Other info / logs

No response

What component(s) does this bug affect?

[X] area/cognitive: Cognitive project
[ ] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[ ] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[X] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[X] integrations/synapse: Azure Synapse integrations
[ ] integrations/azureml: Azure ML integrations
[ ] integrations/databricks: Databricks integrations

from pyspark.sql.types import StructType, StructField, StringType, ArrayType # Define the Glossary schema glossary_schema = StructType([ StructField("format", StringType(), True), StructField("glossaryUrl", StringType(), True), StructField("storageSource", StringType(), True), StructField("version", StringType(), True) ]) # Define the TargetInput schema target_input_schema = StructType([ StructField("category", StringType(), True), StructField("glossaries", ArrayType(glossary_schema), True), StructField("targetUrl", StringType(), False), StructField("language", StringType(), False), StructField("storageSource", StringType(), True) ]) from pyspark.sql import Row # Sample data for the TargetInput column data = [ Row(category="Category1", glossaries=[ Row(format="PDF", glossaryUrl="http://example.com/glossary1.pdf", storageSource=None, version="1.0"), Row(format="HTML", glossaryUrl="http://example.com/glossary2.html", storageSource="source1", version=None) ], targetUrl="http://example.com/target1", language="en", storageSource="sourceA"), Row(category=None, glossaries=None, targetUrl="http://example.com/target2", language="fr", storageSource=None) ] df = spark.createDataFrame(data, schema=target_input_schema) df.show(truncate=False)

microsoft / SynapseML