ijyliu / data-engineering-project

Building and benchmarking a vector similarity search and retrieval augmented generation (RAG) system with Milvus
0 stars 0 forks source link

Milvus Performance Estimation #3

Closed ijyliu closed 5 months ago

ijyliu commented 7 months ago

Find ways to do performance testing in Milvus like EXPLAIN ANALYZE. Try to get timing, query plans, etc.

ijyliu commented 7 months ago

Here's what I think would be a good set of tests.

Testing Indexes

Write a reference query ("Does Apple sell my information to third-party advertisers?" for example; use something like the below code from the RAG folder to format your query text!) and try it out for the different index types (FLAT, IVF_FLAT, etc.) for a fixed metric (L2 I guess): https://milvus.io/docs/index.md. For each index type, get the query time. Also, get the similarity score of the most similar vector returned. We can compare this score with that obtained from the exhaustive search FLAT, to see how much accuracy is lost when using IVF, etc. ANNS (approximate nearest neighbor search) methods.

def get_mixedbread_of_query(model, query: str):
    '''
    Returns mixedbread embedding for an input text. Text is appropriately formatted to be a query.

    Parameters:
    - model: embedding model
    - query: str: The query to be transformed.
    '''
    transformed_query = f'Represent this sentence for searching relevant passages: {query}'
    return model.encode(transformed_query)

Be sure to reset the index to whatever it is right now (L2, IVF_FLAT, 128 I think) once you are done.

Testing Queries

For a fixed metric and index, compare the retrieval time for a variety of queries ("What am I allowed to post on Facebook?", "How can I delete my Reddit account?", etc.)

ijyliu commented 7 months ago

@FayeL6

what you have in https://github.com/ijyliu/data-engineering-project/blob/main/Code/Load%20Embeddings%20and%20Create%20Index/test.ipynb is interesting

i'd skip the TheHersheyCompany queries since we already mention elsewhere milvus is slow for string queries.

for the vector search, not really sure it's a good idea to use the vector you got as the first result from somewhere else, because i'm not really sure in what order milvus returns the results. essentially, the timing from the first result might be biased/artificially low. so i would try to either generate a vector, or maybe do something like find the vector at the median index and then run the search for that

I think the stuff i listed above about testing indices will be super interesting. testing queries may be less so, because the items you search for can be kind of arbitrary, but maybe still good to have

ijyliu commented 7 months ago

i'd also say we should put the timing results for things in this issue (where there's actually a comparison, not a one-off thing) into bar plots with the time in seconds labelled above the bar

FayeL6 commented 7 months ago

Thanks for the advice! I'll try to do the comparison tomorrow.

On Sun, Apr 14, 2024 at 10:22 PM Isaac Liu @.***> wrote:

i'd also say we should put the timing results for things in this issue (where there's actually a comparison, not a one-off thing) into bar plots with the time in seconds labelled above the bar

— Reply to this email directly, view it on GitHub https://github.com/ijyliu/data-engineering-project/issues/3#issuecomment-2055272975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYN7R3GGVGJMQGMVPIDKS3LY5NPZBAVCNFSM6AAAAABFD6UTPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJVGI3TEOJXGU . You are receiving this because you were mentioned.Message ID: @.***>

ijyliu commented 7 months ago

potentially compute as averages over 1,000 or 100 queries

FayeL6 commented 7 months ago

Do you mean the average of the top 100 returned sentences? That's my understanding.

On Mon, Apr 15, 2024 at 3:58 PM Isaac Liu @.***> wrote:

potentially compute as averages over 1,000 or 100 queries

— Reply to this email directly, view it on GitHub https://github.com/ijyliu/data-engineering-project/issues/3#issuecomment-2057949559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYN7R3C3QSPDUP26K3SMJULY5RLQNAVCNFSM6AAAAABFD6UTPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXHE2DSNJVHE . You are receiving this because you were mentioned.Message ID: @.***>

ijyliu commented 7 months ago

No, running the query 100 times. So we aren't just relying on one run

And you can T test or something if you want lol

On Mon, Apr 15, 2024, 7:19 PM Faye Li @.***> wrote:

Do you mean the average of the top 100 returned sentences? That's my understanding.

On Mon, Apr 15, 2024 at 3:58 PM Isaac Liu @.***> wrote:

potentially compute as averages over 1,000 or 100 queries

— Reply to this email directly, view it on GitHub < https://github.com/ijyliu/data-engineering-project/issues/3#issuecomment-2057949559>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AYN7R3C3QSPDUP26K3SMJULY5RLQNAVCNFSM6AAAAABFD6UTPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXHE2DSNJVHE>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/ijyliu/data-engineering-project/issues/3#issuecomment-2058113789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4IRBU5ASCGFM565FJLY5SDDBAVCNFSM6AAAAABFD6UTPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYGEYTGNZYHE . You are receiving this because you authored the thread.Message ID: @.***>