This project is a part of the MLOps Best Practices series. In this project, we will build a Bedrock Agent to query Athena Database. The project is built using AWS CDK and Python.
Every organization has a data lake where they store their data. The data is stored in different formats and is queried using different tools. One of the most popular tools to query data in the data lake is Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With Athena, there is no need for complex ETL jobs to prepare the data for analysis. This makes it easy for anyone with SQL knowledge to query the data in the data lake.
However, not many people in the organization have SQL knowledge. This makes it difficult for them to benefit from the plethora of data stored in the data lake. To solve this problem, we can build a Bedrock Agent that understand natural language queries and can query the data and reply back with the results.
This project is designed not just as a demo. It is a real-world project that can be used in production. The project is built using best practices:
The solution is built using the following services:
Clone the repository
git clone https://github.com/guyernest/bedrock-agent.git
Install the dependencies
cd bedrock-agent
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Deploy the stack (wait for the deployment to finish, about 5 minutes, and note the output values for the next steps)
cdk deploy
Upload the data to S3
aws s3 cp sample-data/ s3://<bucket-name>/data --recursive
Trigger Glue Crawler (wait for the crawler to finish, about 2 minutes)
aws glue start-crawler --name <crawler-name>
Open the App Runner URL in the browser (appears in the output of the CDK deployment)
The project includes a sample dataset in the sample-data directory. However, other datasets can be used. The following Jupyter notebook gives an example of how to analyze the data to enrich the AI agent prompt, to better understand the data and answer the natural language questions.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .venv
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
.venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
pip install -r requirements.txt
At this point you can now synthesize the CloudFormation template for this code.
cdk synth
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentationEnjoy!