run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.04k stars 5.13k forks source link

when is the embedding generated by llama-index for the input text #3780

Closed suraj-gade closed 1 year ago

suraj-gade commented 1 year ago

Hi,

Below is the code that I am using to do inference on Fastchat LLM.

from llama_index import GPTListIndex, SimpleDirectoryReader, GPTVectorStoreIndex, PromptHelper, LLMPredictor
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'lmsys/fastchat-t5-3b-v1.0'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

from transformers import pipeline

pipe = pipeline(
    "text2text-generation", model=model, tokenizer=tokenizer,
    max_length=1024, temperature=0, top_p = 1,no_repeat_ngram_size=4, early_stopping=True
)

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

# set maximum input size
max_input_size = 2048
# set number of output tokens
num_outputs = 512
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 512
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap)

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm_predictor=LLMPredictor(llm), prompt_helper=prompt_helper, chunk_size_limit=chunk_size_limit)

# build index
documents = SimpleDirectoryReader('data').load_data()

new_index = GPTListIndex.from_documents(documents, service_context=service_context)

# query with embed_model specified
query_engine = new_index.as_query_engine(
    retriever_mode="embedding", 
    verbose=True,
    #streaming=True,
    similarity_top_k=1
    #service_context=service_context
)

response = query_engine.query("sample query question?")

Here the "data" folder has my full input text in pdf format, and am using the llama_index and langchain pipeline to build the index on that and fetch the relevant chunk to generate the prompt with context and query the FastChat model as shown in the code.

I want to understand when does llama_index generate the embeddings for the input text from the "data" folder. is it at the time of indexing new_index = GPTListIndex.from_documents(documents, service_context=service_context) the embeddings are generated for all the nodes/chunks in the input text of document or at the time of query query_engine.query("sample query question?") when the relevant chunk/node is to be fetched with similar embeddings as that of input prompt.

Please help me understand at what point does llama_index generate the embeddings.

logan-markewich commented 1 year ago

Embeddings are generated for all the documents during index construction

At query time, only the query text is embedded, and then the query + relevant nodes are sent to the LLM to make a response

suraj-gade commented 1 year ago

Hi @jerryjliu, @logan-markewich Thanks for the response.

In the llama_index documentation here , it says that for List Index, the embeddings are generated during query() and not during index construction.

Actually my goal is to generated the embeddings during index construction, assuming it will reduce the inference time during query.

Please have a look and let me know your thoughts.

logan-markewich commented 1 year ago

@suraj-gade ah I missed that you were using a list index. You likely be more interested in using GPTVectorStoreIndex