microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

Simplify setUrl() on Cognitive Services to not require full URL for different cloud environments #1435

Open dayland opened 2 years ago

dayland commented 2 years ago

Feature Request Details Currently when using any of the Cognitive Service transformers against a sovereign cloud environment such as AzureUSGovernment, you must specify the entire URL including the version. This must match exactly the version that the SynapseML version is built for as the schema models are bound to the return structures from that specific version of the API.

For example, to currently use the TextAnalytics Transformer against the AzureUSGovernment environment you must specify the following:

text_analyze = (TextAnalyze()
        .setTextCol(text_split_col)
        .setLocation(config.location)
        .setSubscriptionKeyCol(text_analytics_key_col)
        .setOutputCol(text_analysis_col)
        .setErrorCol(text_analysis_error_col)
        .setConcurrency(cog_svc_concurrency)
        .setInitialPollingDelay(cog_svc_intial_polling_delay)
        .setPollingDelay(cog_svc_polling_delay)
        .setMaxPollingRetries(cog_svc_maximum_retry_count)
        .setSuppressMaxRetriesExceededException(True)
        .setLanguage(config.prep.target_language)
        .setEntityRecognitionTasks([{"parameters": { "model-version": "latest"}}])
        .setKeyPhraseExtractionTasks([{"parameters": { "model-version": "latest"}}])
        .setEntityRecognitionPiiTasks([{"parameters": { "model-version": "latest"}}])
        .setUrl('https://virginia.api.cognitive.microsoft.us/text/analytics/v3.1/analyze')
        )

What needs to be changed for AzureUSGovernment is really just "https://virginia.api.cognitive.microsoft.us". The rest of the url parts "/text/analytics/v3.1/analyze" which is version specific, should be taken from an already defined property in the SynapseML code base.

The urlPath property is already defined in cognitive/src/main/scala/com/microsoft/azure/synapse/ml/cognitive/TextAnalytics.scala as:

def urlPath: String = "/text/analytics/v3.1/analyze"

When using .setLocation(), the urlPath property is automatically added to the URL for the developer. But when using setUrl() it is ignored requiring the developer to specify the full path including the urlPath parts. This requires the developer to know exactly which version of the Cognitive Services API that Synapse ML was compiled against for the response to be valid.

Describe the solution you'd like The preferred approach would be for setUrl() to be more like setUrlBase() where you can specify a new service url by region or cloud environment. Then the transformer code would automatically append the urlPath content for the developer so that we don't have to know the exact API version that SynapseML was compiled against.

Example:

text_analyze = (TextAnalyze()
        .setTextCol(text_split_col)
        .setLocation(config.location)
        .setSubscriptionKeyCol(text_analytics_key_col)
        .setOutputCol(text_analysis_col)
        .setErrorCol(text_analysis_error_col)
        .setConcurrency(cog_svc_concurrency)
        .setInitialPollingDelay(cog_svc_intial_polling_delay)
        .setPollingDelay(cog_svc_polling_delay)
        .setMaxPollingRetries(cog_svc_maximum_retry_count)
        .setSuppressMaxRetriesExceededException(True)
        .setLanguage(config.prep.target_language)
        .setEntityRecognitionTasks([{"parameters": { "model-version": "latest"}}])
        .setKeyPhraseExtractionTasks([{"parameters": { "model-version": "latest"}}])
        .setEntityRecognitionPiiTasks([{"parameters": { "model-version": "latest"}}])
        .setUrlBase('https://virginia.api.cognitive.microsoft.us')
        )

Additional context This was tested against SynapseML v0.9.5

AB#1889420

mhamilton723 commented 2 years ago

Hey @dayland, thanks so much for reaching out about this. Do you know if this proposal would also work with cognitive service containers? Thats originally why we made it pretty open-ended

dayland commented 2 years ago

That is a valid point, as we have not tested with containerized cognitive services. I guess the easiest way would be to clearly document which version of the APIs each transformer is bound to so that when a user needs to specify the full URL, we know exactly which endpoints to specify.