sgt1796 / GPT_embedding

0 stars 1 forks source link

GPT embedding

This repo contains codes needed for embedding and searching using various embedding models (Openai, Jina, pytorch model). Main scripts are the following:

The general workflow looks like this

GPT embedding workflow

Requirement

For GPT and Jina related service, API key is needed. Save your API key in .env in the same folder with the script, it should look like

# you can have keys to different API
OPENAI_API_KEY=<your openai key>
JINAAI_API_KEY=<your jinaai key>

The full testing data set is Amazon Fine Food Review.

GPT Embedding Usage

GPT_embedding.py retrieves embedding from GPT embedding API using text-embedding-3-small model, which transform input strings into a float vector of 1572 elements. The embedded data will have added embedding column.

GPT_embedding.py -h
usage: GPT_embedding.py [-h] -i INPUT_FILE -o OUTPUT_FILE [--out_format {csv,tsv}] [-c COLUMNS [COLUMNS ...]] [--chunk_size CHUNK_SIZE] [--minimize] [--process PROCESS]

Generate GPT embeddings for text data.

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        Path to the input file, accepts .csv or .txt with tab as separator.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Path to the output file.
  --out_format {csv,tsv}
                        Output format: 'csv' or 'tsv' (default: csv).
  -c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        Column names to combine.
  --chunk_size CHUNK_SIZE
                        Number of rows to load into memory at a time. By default whole file will be load into the memory.
  --minimize            Minimize output to only the combined and embedding columns.
  --process PROCESS     Number of processes to call. Default will be 1 process per vCPU.

An example to run on small sample:

python GPT_embedding.py -i data/Reviews_1k.csv -o test_embedding_1k.csv --out_format csv -c Summary Text --chunk_size 500

Search with user query

File with <10k rows

To search for similarity in small embedding data (<10k rows) use similarity_search_10k.py, FAISS index and SQL db is not needed for this method. If .env is not in the same folder, specify the path with --env. Currently supports only API embedding methods

# use GPT embedding method
python3 similarity_search_10k.py -q 'I love eating ice cream!' -f "embedding_1k.csv" -n 3 --api openai 

This method will go through every line and finds the top-n similar vectors to retreive.

Large files

For larger data where brute force search is not possible, build_FAISS_index.py can be used to build FAISS index using IVFPQ method. This will significantly reduce the size to query on as well as the query time. The text data should be stored with build_SQLite.py to reduce the storage size and search speed.

build_FAISS_index.py

Choosing Parameters for Building a FAISS Index

When building a FAISS index, selecting the correct parameters is crucial for balancing accuracy and performance. Below are some guidelines for choosing the parameters used in the build_FAISS_index.py script:

usage: build_FAISS_index.py [-h] [--chunk_size CHUNK_SIZE] [--file_path FILE_PATH] [--out_path OUT_PATH]
                            [--nrow NROW] [--nlist NLIST] [--dimension DIMENSION] [--nsubvec NSUBVEC] [--nbits NBITS]
                            [--resvoir_sample RESVOIR_SAMPLE]

options:
  -h, --help            show this help message and exit
  --chunk_size CHUNK_SIZE
                        Size of each chunk. If the data is too large to cluster all at once, use this and
                        resvoir_sample to cluster the data in chunks
  --file_path FILE_PATH, -i FILE_PATH
                        Path to the data file
  --out_path OUT_PATH, -o OUT_PATH
                        Path to the output file
  --nrow NROW           Number of rows in the data file, needed only if the data is loaded in chunks
  --nlist NLIST         Number of Voronoi cells to divide. lower this increases accuracy, decreases speed. Default is
                        sqrt(nrow)
  --dimension DIMENSION, -d DIMENSION
                        Dimension of the embeddings, will use the dimension of the first embedding if not provided
  --nsubvec NSUBVEC     Number of subvectors divide the embeddings into, dimension must be divisible by nsubvec
  --nbits NBITS         Number of bits for clustering, default is 8
  --resvoir_sample RESVOIR_SAMPLE
                        Perform Reservoir Sampling to draw given number of samples to cluster. By default is no
                        sampling. Must use sampling if the chunk_size is provided)

Example Configuration

For a dataset with 568,428 rows and a vector dimension of 1536:

build_SQLite.py

After building the faiss index, it is also needed to build text data into a SQL database to accelerate retrival speed.

(will add more detail later)

search_faiss_CLI.py

This script provides a command-line interface (CLI) for performing a FAISS-based search on a SQLite database.

python faiss_search_CLI.py -h
usage: faiss_search_CLI.py [-h] [--query QUERY] [--db DB] [--index INDEX] [--top TOP] [--verbose]

Faiss Search CLI

options:
  -h, --help            show this help message and exit
  --query QUERY, -q QUERY
                        Query string
  --db DB               Database file path
  --index INDEX, -x INDEX
                        Index file path
  --top TOP, -n TOP     Number of results to return (default: 5)
  --verbose, -v         Print verbose output

The function takes 3 required arguments:

The --verbose/-v option will return a human-readable texts (the example output in the flowchart), disabling the -v option will print the result line by line, which are easier to be read by other scripts.

Here's an example usage to search for a random query with pre-built database Reviews.db and IVFPQ index IVFPQ_index.bin:

python3 GPT_embedding/faiss_search_CLI.py --query "Recommand me some spicy chinese food\n" --db /path/to/Reviews.db --index /path/to/IVFPQ_index.bin --top 5 -v

Benchmark

The testing data Amazon Fine Food Review have 568428 rows, and size of 290M.

The testing data after embedding is 19G in size.

The IVFPQ index is around 60M, the SQLite database is around 300M (with just 1 row of combined columns)

568428 rows / 19.5G = 29150.1538 row/G

Determine --chunk_size parameter for GPT_embedding.py

To run with 3G of RAM and 12 processes (via multiprocessing), the chunksize parameter should no greater than

29150.1538 row/G * 3 G = 87450 rows/proc

While in practice this number can be set at most 1/2 of the maximum, to allow worst situations. Here's an example code to build embedding for testing data on server

nohup python3 GPT_embedding/GPT_embedding.py -i GPT_embedding/Reviews_1k.csv -o /disk3/GPT_embedding_output/Reviews_embedding.csv --out_format csv -c Summary Text --chunk_size 10000 --process 12 > process.log 2>&1 &

This will call the script to run in background. To see all the processes that are running, use

ps aux | grep GPT_embedding.py

## To kill all the processes running
ps aux | grep GPT_embedding.py | grep -v grep | awk '{print $2}' | xargs kill

You can occasionally check the progress by checking the tail of process.log

tail -n 50 process.log