sgt1796 / GPT_embedding

0 stars 1 forks source link

[Dockerization] need to handle .env path #10

Closed sgt1796 closed 2 months ago

sgt1796 commented 2 months ago

This issue should be resolved before #8

GPT_embedding.py cannot handle .env that's not in the same folder

the .env cannot be store in docker image, it has to be mounted via -v option of docker run. this means the .env file will not be under the same folder as the scripts are.

By default, load_env() looks for .env at current folder and load it at the begining of the code and then create the client.

load_env()
client = OpenAI()

@backoff.on_exception(backoff.expo, RateLimitError, max_time=30)
def _get_embedding(text):
    # This function uses "client"
    ...

def ...

To add user customized --env means this declaration have to move from beginning to the main() -- which will causing the client to be local

@backoff.on_exception(backoff.expo, RateLimitError, max_time=30)
def _get_embedding(text):
    # This function uses "client"
    ...

def ...

def main(args):
    # These become local variables
    load.env(args.env)
    client = OpenAI()

Attempt 1

Declare client in main(), use init() to initialize client globally for each process

## initialize for each worker
def init(count, chunks, embedding_model, GPT_client):
    global counter, nchunks, EMBEDDING_MODEL, client
    counter, nchunks = count, chunks
    EMBEDDING_MODEL = embedding_model
    client = GPT_client

@backoff.on_exception(backoff.expo, RateLimitError, max_time=30)
def _get_embedding(text):
    # This function uses "client"
    ...

def ...

def main(args):
    # These become local variables
    load.env(args.env)
    GPT_client = OpenAI()
    ...
    ...
    with multiprocessing.Pool(process, initializer=init, initargs=(counter, nchunks, embedding_model, GPT_client)) as pool:
        ...

This will causing client to be duplicate and causes error.

Attempt 2

Pass client as a parameter to the required function

# init() not taking client in this attempt
def init(count, chunks, embedding_model):
    global counter, nchunks, EMBEDDING_MODEL, client
    counter, nchunks = count, chunks
    EMBEDDING_MODEL = embedding_model

@backoff.on_exception(backoff.expo, RateLimitError, max_time=30)
def _get_embedding(text, client):
    # This function uses "client"
    ...

def process_chunk(chunk_index, chunk, client):
    ...
    _get_embedding(text, client)
    ...

def main(args):
    # These become local variables
    load.env(args.env)
    client = OpenAI()
    ...
    ...
    with multiprocessing.Pool(process, initializer=init, initargs=(counter, nchunks, embedding_model)) as pool:
        results = pool.starmap(process_chunk, [(i, chunk, client) for i, chunk in enumerate(lst)])

This won't work and raise error

Error while running: cannot pickle '_thread.RLock' object

The error occurs because the OpenAI client object contains a _thread.RLock object, which cannot be pickled. The multiprocessing library in uses pickling to pass objects between processes, and thus it fails when it encounters the RLock.

ronaldlindev commented 2 months ago

I'm pretty sure you can pass an absolute path into load_dotenv() which should solve the issue, I'll take a look

sgt1796 commented 2 months ago

Fixed. load_env() moved to main(), client creation moved to init(). GPT_embedding will now initalize new client upon initialization of Pooling

# the api key is loaded by load.env(args.env) at main()
def init(count, chunks, embedding_model):
    global counter, nchunks, EMBEDDING_MODEL, client
    counter, nchunks = count, chunks
    EMBEDDING_MODEL = embedding_model

    # initialize a new OpenAI client for the workers
    client = OpenAI()

--env option is added for CLI.

More detail for the change: c4e32e029ff0595394b6cad37a352d8cbac68d0d