guyernest / bedrock-agent

Building chat interface to support natural language questions on random datasets, using generative AI on AWS.
5 stars 1 forks source link
agents aws bedrock generative-ai

MLOps Best Practices: Building Bedrock Agent to query Athena Database

This project is a part of the MLOps Best Practices series. In this project, we will build a Bedrock Agent to query Athena Database. The project is built using AWS CDK and Python.

The Problem

Every organization has a data lake where they store their data. The data is stored in different formats and is queried using different tools. One of the most popular tools to query data in the data lake is Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With Athena, there is no need for complex ETL jobs to prepare the data for analysis. This makes it easy for anyone with SQL knowledge to query the data in the data lake.

However, not many people in the organization have SQL knowledge. This makes it difficult for them to benefit from the plethora of data stored in the data lake. To solve this problem, we can build a Bedrock Agent that understand natural language queries and can query the data and reply back with the results.

Solution Example

Bedrock Agent Chat UI

Not only a demo

This project is designed not just as a demo. It is a real-world project that can be used in production. The project is built using best practices:

The Solution

Architecture Diagram

The solution is built using the following services:

Quick Start

  1. Clone the repository

    git clone https://github.com/guyernest/bedrock-agent.git
  2. Install the dependencies

    cd bedrock-agent
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  3. Deploy the stack (wait for the deployment to finish, about 5 minutes, and note the output values for the next steps)

    cdk deploy
  4. Upload the data to S3

    aws s3 cp sample-data/ s3://<bucket-name>/data --recursive
  5. Trigger Glue Crawler (wait for the crawler to finish, about 2 minutes)

    aws glue start-crawler --name <crawler-name>
  6. Open the App Runner URL in the browser (appears in the output of the CDK deployment)

Main Components

Messy Data Analysis

The project includes a sample dataset in the sample-data directory. However, other datasets can be used. The following Jupyter notebook gives an example of how to analyze the data to enrich the AI agent prompt, to better understand the data and answer the natural language questions.

CDK Python Instructions

This project is set up like a standard Python project. The initialization process also creates a virtualenv within this project, stored under the .venv directory. To create the virtualenv it assumes that there is a python3 (or python for Windows) executable in your path with access to the venv package. If for any reason the automatic creation of the virtualenv fails, you can create the virtualenv manually.

To manually create a virtualenv on MacOS and Linux:

python3 -m venv .venv

After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.

source .venv/bin/activate

If you are a Windows platform, you would activate the virtualenv like this:

.venv\Scripts\activate.bat

Once the virtualenv is activated, you can install the required dependencies.

pip install -r requirements.txt

At this point you can now synthesize the CloudFormation template for this code.

cdk synth

To add additional dependencies, for example other CDK libraries, just add them to your setup.py file and rerun the pip install -r requirements.txt command.

Useful commands

Enjoy!