This repository demonstrates how to implement a Machine Learning Development and Operations (MLOps) process for Azure AI Search applications that use a pull model to index data. It creates an indexer with two custom skills that pull pdf documents from a blob storage container, chunks them, creates embeddings for the chunks and then adds the chunks into an index. Finally, it performs search evaluation for a collection of data and uploads the results to an AI Studio project so that evaluations can be compared across multiple runs to continue improving the custom skills.
Standard S3
Below are some key folders within the project:
Additionally, the root folder contains some important files:
.env
and sensitive parameters (parameters that cannot be hardcodeded in config.yaml
) should be populated here.The deployment scripts and github workflows use the git branch name to create a unique naming scheme for all of the deployed entities.
.env
file based on .env.sample
and populate the appropriate values.config/config.yaml
to meet any changes that have been made within the project.Sample pdfs are available in data
to use for indexer testing. To upload the data to blob storage, use the following:
python -m mlops.deployment_scripts.upload_data
The following deployment script will deploy the custom skillset functions to a function app deployment slot and poll the functions until they are ready to be tested:
python -m mlops.deployment_scripts.deploy_azure_functions
To test the two skillset functions after they are deployed, run the following script:
python -m mlops.deployment_scripts.run_functions
More information aboud local development of skillset functions can be found in the custom skills readme.
An indexer is composed for four entities: index, datasource, skillset, and indexer. The configuration for each is defined by the files in mlops/acs_config
. To deploy the indexer and commence indexing the data in blob storage, run the following:
python -m mlops.deployment_scripts.build_indexer
This will perform search evaluation and upload the result to the AI Studio project specified. For more information about evaluation, see the search evaluation readme.
python -m mlops.evaluation.search_evaluation --gt_path "./mlops/evaluation/data/search_evaluation_data.jsonl" --semantic_config my-semantic-config`
Since the git branch name was used to create the deployed entities, this deployment script will clean up everything by deleting the deployment slot in the function app and the indexer entities.
python -m mlops.deployment_scripts.cleanup_pr
This project contains github workflows for PR validation and Continuous Integration (CI).
The PR workflow executes quality checks using flake8 and unit tests. It then deploys the skillset functions to a deployment slot of the function app. Once the functions are deployed and tested, an indexer is deployed and all of the test data is ingested from blob storage. Search evaluation is run and uploaded to an AI Studio project.
The CI workflow executes a similar workflow to the PR workflow, but the skillset functions are deployed to the main function app, not a deployment slot.
In order for the cleanup step of the CI Workflow to work correctly, the development branch from a pull request must not be deleted until the cleanup step has run.
Some variables and secrets should be provided to execute the github workflows (primarily the same ones used in the .env
file for local execution).
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.