Closed pamelafox closed 8 months ago
Thanks @pamelafox for all these updates! Few quick notes:
but that requires customizing the Spark environment which takes ~20 minutes, which isn't great for a live demo. I do have the SDK code equivalent if we're interested in that.
I don't think this requires customizing the spark env. If it's just a pip package, you can do a %pip install
in a cell. It'll reboot the kernel but won't take even a minute. And YES I think we should include the SDK code too. We're working on "code tabs" in MOAW (to support multi-language). It'll be easier to add the SDK code once we ship this feature ;)
Figure out if the synapse call is compatible with OpenAI proxy.
Good question! The SynapseML Reference seems to be broken. @iemejia Could you send this feedback to the product team?
Consider storage in Key Vault. (Probably out of scope for workshop, but another best practice)
I would consider it's part of this workshop (already made the comment to @Jcardif). After this week event, we could add a "bonus" section at the end to add this.
@cmaneu The pip install approach only worked for standard cells, but as soon as I tried to do a UDF or map partitions, I got an error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 199) (vm-55578899 executor 2): java.io.IOException: Cannot run program "/nfs4/pyenv-cfbe9b5d-27a5-4b3e-acac-6a4be719144b/bin/python": error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
I had my partner take a look, as he's a Spark user, and he thinks it's a bug in the environment setup. The error goes away as long as I don't pip install. Is there a good place to report Fabric bugs?
I'm testing this on an Azure free account right now.
Notes:
1) You can only have one Azure AI search free tier at a time, so if an attendee already has one, they should probably just reuse that. (That's what I'm going to do) 2) Azure Document Intelligence itself has a free tier, but the wrapper resource Azure AI Services only has a Standard tier. However if an attendee still has their $200 free trial, that should hopefully apply. We could alternatively have folks sign up specifically for Azure Document Intelligence itself, and then they could explicitly select the free tier. 3) I got up until "Generating Embeddings" and got an auth error, as I expected: "Exception: get openai mwc token returns 403:b'{"Message":"FT1 SKU Not Supported","Source":"ML","error_code":"PERMISSION_DENIED"}'" I am hoping that we discover a way that we can set the endpoint URI for the synapse OpenAIEmbedding class, so that we can use the proxy at that point.
I found there's a setEndpoint for OpenAIEmbedding in the Scala reference, so am going to try that out and see if it works with the proxy.
I've tried various combinations of setEndpoint, setDefaultInternalEndpoint, and setCustomServiceName. The functions exist in Python, but I always get the same error:
"Traceback (most recent call last): File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages/synapse/ml/fabric/token_utils.py", line 180, in _get_openai_mwc_token raise Exception( Exception: get openai mwc token returns 403:b'{"Message":"FT1 SKU Not Supported","Source":"ML","error_code":"PERMISSION_DENIED"}'"
I wonder that Fabric library is pre-emptively checking for the built-in models and preventing us from using non-built-ins. The _get_openai_mwc_token doesn't seem to be in GitHub so I can't inspect the code myself.
The error message says it's trying to use the fabric OpenAI endpoint (FT1 is fabric trial). Can you share the code for SynapseML OpenAI initialization?
@cmaneu Example:
from synapse.ml.services import OpenAIEmbedding
embedding = (
OpenAIEmbedding()
.setDefaultInternalEndpoint('https://polite-ground-030dc3103.4.azurestaticapps.net/api/v1')
.setSubscriptionKey('my-key-here')
.setDeploymentName("text-embedding-ada-002")
.setTextCol("chunk")
.setErrorCol("error")
.setOutputCol("embeddings")
)
I am guessing at what functions to use, there's also a setEndpoint and setCustomServiceName, but all my permutations so far have resulted in that error.
I chatted with Alvaro and for now, we will just explain the limitations around using OpenAIEmbedding inside Fabric free tier. I went through the workshop several more times in fresh notebooks and made additional improvements, including optional KeyVault usage (which I'll demo) and better support for arbitrary files (no more hardcoded filenames or assumptions around single file datasets).
I am done making changes to this PR, so if anyone is able to merge it before tomorrow morning's workshop, that'd be great!
Thanks @cmaneu
:tada: This PR is included in version 1.5.0 :tada:
The release is available on npm package (@latest dist-tag)
Your semantic-release bot :package::rocket:
This PR includes a number of changes based on my testing of the workshop and a meeting with search engineer @mattgotteiner. Usually I'd like to split changes into multiple PRs but given the workshop is coming up soon, it's all in this PR.
High level:
Still TODO: