microsoft / moaw

Grab-and-go resources to help you learn new skills, with all the tools you need to create, host and share your own workshop
https://aka.ms/moaw
Creative Commons Attribution Share Alike 4.0 International
108 stars 83 forks source link

Updates to Fabric RAG workshop to use latest Azure AI Search #103

Closed pamelafox closed 8 months ago

pamelafox commented 8 months ago

This PR includes a number of changes based on my testing of the workshop and a meeting with search engineer @mattgotteiner. Usually I'd like to split changes into multiple PRs but given the workshop is coming up soon, it's all in this PR.

High level:

Still TODO:

cmaneu commented 8 months ago

Thanks @pamelafox for all these updates! Few quick notes:

but that requires customizing the Spark environment which takes ~20 minutes, which isn't great for a live demo. I do have the SDK code equivalent if we're interested in that.

I don't think this requires customizing the spark env. If it's just a pip package, you can do a %pip install in a cell. It'll reboot the kernel but won't take even a minute. And YES I think we should include the SDK code too. We're working on "code tabs" in MOAW (to support multi-language). It'll be easier to add the SDK code once we ship this feature ;)

Figure out if the synapse call is compatible with OpenAI proxy.

Good question! The SynapseML Reference seems to be broken. @iemejia Could you send this feedback to the product team?

Consider storage in Key Vault. (Probably out of scope for workshop, but another best practice)

I would consider it's part of this workshop (already made the comment to @Jcardif). After this week event, we could add a "bonus" section at the end to add this.

pamelafox commented 8 months ago

@cmaneu The pip install approach only worked for standard cells, but as soon as I tried to do a UDF or map partitions, I got an error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 199) (vm-55578899 executor 2): java.io.IOException: Cannot run program "/nfs4/pyenv-cfbe9b5d-27a5-4b3e-acac-6a4be719144b/bin/python": error=2, No such file or directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)

I had my partner take a look, as he's a Spark user, and he thinks it's a bug in the environment setup. The error goes away as long as I don't pip install. Is there a good place to report Fabric bugs?

pamelafox commented 8 months ago

I'm testing this on an Azure free account right now.

Notes:

1) You can only have one Azure AI search free tier at a time, so if an attendee already has one, they should probably just reuse that. (That's what I'm going to do) 2) Azure Document Intelligence itself has a free tier, but the wrapper resource Azure AI Services only has a Standard tier. However if an attendee still has their $200 free trial, that should hopefully apply. We could alternatively have folks sign up specifically for Azure Document Intelligence itself, and then they could explicitly select the free tier. 3) I got up until "Generating Embeddings" and got an auth error, as I expected: "Exception: get openai mwc token returns 403:b'{"Message":"FT1 SKU Not Supported","Source":"ML","error_code":"PERMISSION_DENIED"}'" I am hoping that we discover a way that we can set the endpoint URI for the synapse OpenAIEmbedding class, so that we can use the proxy at that point.

pamelafox commented 8 months ago

I found there's a setEndpoint for OpenAIEmbedding in the Scala reference, so am going to try that out and see if it works with the proxy.

pamelafox commented 8 months ago

I've tried various combinations of setEndpoint, setDefaultInternalEndpoint, and setCustomServiceName. The functions exist in Python, but I always get the same error:

"Traceback (most recent call last): File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages/synapse/ml/fabric/token_utils.py", line 180, in _get_openai_mwc_token raise Exception( Exception: get openai mwc token returns 403:b'{"Message":"FT1 SKU Not Supported","Source":"ML","error_code":"PERMISSION_DENIED"}'"

I wonder that Fabric library is pre-emptively checking for the built-in models and preventing us from using non-built-ins. The _get_openai_mwc_token doesn't seem to be in GitHub so I can't inspect the code myself.

cmaneu commented 8 months ago

The error message says it's trying to use the fabric OpenAI endpoint (FT1 is fabric trial). Can you share the code for SynapseML OpenAI initialization?

pamelafox commented 8 months ago

@cmaneu Example:

from synapse.ml.services import OpenAIEmbedding

embedding = (
    OpenAIEmbedding()
    .setDefaultInternalEndpoint('https://polite-ground-030dc3103.4.azurestaticapps.net/api/v1')
    .setSubscriptionKey('my-key-here')
    .setDeploymentName("text-embedding-ada-002")
    .setTextCol("chunk")
    .setErrorCol("error")
    .setOutputCol("embeddings")
)

I am guessing at what functions to use, there's also a setEndpoint and setCustomServiceName, but all my permutations so far have resulted in that error.

pamelafox commented 8 months ago

I chatted with Alvaro and for now, we will just explain the limitations around using OpenAIEmbedding inside Fabric free tier. I went through the workshop several more times in fresh notebooks and made additional improvements, including optional KeyVault usage (which I'll demo) and better support for arbitrary files (no more hardcoded filenames or assumptions around single file datasets).

I am done making changes to this PR, so if anyone is able to merge it before tomorrow morning's workshop, that'd be great!

videlalvaro commented 8 months ago

Thanks @cmaneu

sinedied commented 4 months ago

:tada: This PR is included in version 1.5.0 :tada:

The release is available on npm package (@latest dist-tag)

Your semantic-release bot :package::rocket: