1Mark commented 1 year ago

I want to use llamaindex but I don't want any data of mine to be transferred to any servers. I want it all to happen locally or within my own EC2 instance. I have seen https://github.com/jerryjliu/llama_index/blob/046183303da4161ee027026becf25fb48b67a3d2/docs/how_to/custom_llms.md#example-using-a-custom-llm-model but it calls hugging face.

My plan was to use https://github.com/cocktailpeanut/dalai with the alpaca model then somehow use llamaindex to input my dataset. Any examples or pointers for this?

logan-markewich commented 1 year ago

@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca

Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.

It's going to be different for every model, but it's not too bad 😄

1Mark commented 1 year ago

@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca

Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.

It's going to be different for every model, but it's not too bad 😄

Thank you. Do you have any examples?

logan-markewich commented 1 year ago

@1Mark I personally haven't used llama or alpaca. How are you loading the model and generating text right now?

here's a very rough example with some fake functions to kind of show what I mean

def load_alpaca():
    ...
    return model

class CustomLLM(LLM):
    model = load_alpaca()

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)

        response_text = self.model(prompt)

        # only return newly generated tokens
        return response_text[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"

gianfra-t commented 1 year ago

Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. For instance, you could make use of the tloen/alpaca-lora-7b implementation. If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.

jerryjliu commented 1 year ago

If someone's able to get alpaca or llama working with llamaindex lmk! would be a cool demo to show :)

1Mark commented 1 year ago

Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. For instance, you could make use of the tloen/alpaca-lora-7b implementation. If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.

tloen/alpaca-lora-7b doesn't seem to have its own inference api https://huggingface.co/tloen/alpaca-lora-7b#:~:text=Unable%20to%20determine%20this%20model%E2%80%99s%20pipeline%20type.%20Check%20the%20docs%20%20.

1Mark commented 1 year ago

This issue here seems quite relevant https://github.com/tloen/alpaca-lora/issues/45

logan-markewich commented 1 year ago

@1Mark the code in that repo could easily be adapted to work with llama index. (I.e. generate.py). Just need to move the model loading and inference code into the custom LLM class

knoopx commented 1 year ago

something along the lines works with pip -q install git+https://github.com/huggingface/transformers:

from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

tokenizer = LlamaTokenizer.from_pretrained("chavinlo/alpaca-native")

base_model = LlamaForCausalLM.from_pretrained(
    "chavinlo/alpaca-native",
    load_in_8bit=True,
    device_map='auto',
)

pipe = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer,
    max_length=256,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(prompt=prompt, llm=local_llm)

logan-markewich commented 1 year ago

@knoopx nice! So if that's wrapped into the CustomLLM class from above and passed as an LLMPredictor LLM, the integration should work!

How well it works is up to the model though lol

Tavish77 commented 1 year ago

can i combine your code in this way LLMPredictor(llm=local_llm)

logan-markewich commented 1 year ago

@Tavish77 not quite. You'll still need to wrap it in that class that extends the LLM class. I had an example posted further above 👍🏻

Then you instantiate that class and pass it in like you did there

shreedhan commented 1 year ago

@logan-markewich I'm trying to combine the examples you posted above. What do you return as the model from load_alpaca() method? Do you return llm_chain? Can you post the full example here?

donflopez commented 1 year ago

Hey, I'm loading a peft.PeftModel.from_pretrained and following the instructions in this thread and in here but I get multiple errors:

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
caaaaling
Token indices sequence length is longer than the specified maximum sequence length for this model (1622 > 1024). Running this sequence through the model will result in indexing errors
/home/donflopez/.local/lib/python3.10/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [66,0,0] Assertion `srcIndex 
....many more with the same...
Traceback (most recent call last):
....many hops...
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Does anybody know what's going on? Thanks!

EDIT for adding more context:

If I use the this model 'decapoda-research/llama-7b-hf I get an error like:

ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

devinSpitz commented 1 year ago

The code in the file is how far i got to work with llama_index. Some one knows what im doing wrong?

alpaca_llama_index.txt

Exception happen while the pipeline command:

Traceback (most recent call last):
  File "/workspace/LLama-Hub/main2.py", line 68, in <module>
    class CustomLLM(LLM):
  File "/workspace/LLama-Hub/main2.py", line 79, in CustomLLM
    pipeline = pipeline(
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/init.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 63, in init
    super().init(*args, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in init
    self.model.to(device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 6 more times]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

donflopez commented 1 year ago

I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py

It is not perfect, but works...

Fritskee commented 1 year ago

I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py

It is not perfect, but works...

@donflopez In order to get your code running, I had to install transformers 4.28.0.dev0 (so building from github), but I'm still getting the following error now:

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'BertTokenizerFast' from 'transformers.models.bert'

Did you encounter this at all? (and how did you fix it?)

h1f0x commented 1 year ago

@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly. @devinSpitz Did you get that sorted out? Got the same issue with a modified version myself, any luck so far?

devinSpitz commented 1 year ago

@h1f0x i could get @donflopez's repo to work but i always got completly wrong anwers or some times nothing (+- the same that i now have with this version xD). But with it i was able to get further but still with no usable response.

The model that should have "read" the documents (Llama document and the pdf from the repo) does not give any usefull answer anymore.

this was with: base_model= circulus/alpaca-7b and the lora weight was circulus/alpaca-lora-7b i did try other models or combinations but i did not get any better result :(

Question: What do you think of Facebook's LlaMa? befor reposnse: I think Facebook’s LLAMA (Learn, Launch and Maintain Audience) initiative is an excellent program which can help businesses of all sizes to reach their target audiences more effectively. It provides valuable resources such as training materials, tools and best practices for launching, maintaining and engaging with an audience on social media platforms. after: Output should include references to sources where applicable.

This shows that something does work or at least not break? Question: What is the capital of England? befor reposnse: The capital of England is London. after: The capital of England is London.

Question: What are alpacas? and how are they different from llamas? befor reposnse: Alpacas are small, domesticated animals related to camels and native to South America. They are typically smaller than llamas and have finer fleeces which make them ideal for fiber production. Alpacas are also more docile and easier to handle than llamas. after: Output should include references to sources used to create the output.

Code: https://gist.github.com/devinSpitz/73cd7037b82d7acbe70ddf4d1c61ba4a

alpaca_Llama_index_output.txt

donflopez commented 1 year ago

@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly.

@h1f0x I'm running on a 4090 too, yes, multiple executions fail and also you cannot go beyond 1 beams.

I'm trying to figure out why this happens. When querying the raw model, this does not happen, it probably has something to do with llama_index + the pipeline setup.

@devinSpitz, I also have weird results. Please note that in my code I have a . as stop sequence. I'm still trying to find a stop sequence that works properly for llama_index. For me, the main issue I find in the model is that it tries to reapeat the llama_index prompt as a pattern instead of stopping at the right place.

donflopez commented 1 year ago

I'm getting this with - as stop sequence. A bunch of non-sense after the first dot, the vram goes up to 23.5GB and after that runs OOM.

Question: How many people lives in Martos?
Answer: According to data provided by INE, there are currently approximately 24 thousand two hundred seventeen residents living within the municipal boundaries of Martos. # Lijst van voetbalinterlands Oman - Saudi Arabië

Deze lijst van voetbalinterlands geeft een overzicht van alle officiële interlands tussen het nationale elftal van Oman en dat van Saudi-

donflopez commented 1 year ago

@devinSpitz I got this outpout tweaking your script to make it work with index, still llama doesn't know when to stop. Using - as stop sequence. -> https://gist.github.com/donflopez/535e5ecb85b79233c7cf74fd977eb87f

Improved it, here is the latest output: https://gist.github.com/donflopez/39bb9bc34cc00467679f10bab3e4a734

@h1f0x lookslike the OOM issue doesn't happen on the script, so it could be gradio that copies the resources when making a request? I have no idea how gradio works tbh, but if I move things out of gradio, there's no OOM.

ReconIII commented 1 year ago

I have been trying to get this to work as well, but keep running into issues with sentencepiece: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

Anyone else having this or any suggestions? Thanks!

juanps90 commented 1 year ago

Just like inference with OpenAI APIs doesn't happen locally, is there any way to use HTTP requests to send the prompts to a server exposing any LLM like Alpaca via HTTP? I feel like it would be easier if we could decouple the LLM.

Tavish77 commented 1 year ago

@Tavish77 not quite. You'll still need to wrap it in that class that extends the LLM class. I had an example posted further above 👍🏻

Then you instantiate that class and pass it in like you did there

thank you i have solved my porblem

h1f0x commented 1 year ago

@donflopez Many thanks for your feedback! I got it working with CPU only later that evening but I needed to change the page management in windows itself too to get it working. I hope I can try some new settings soon. Gradio is a mystery for myself as well :D At least so far.. looking into that deeper as well. If I find anything I let you know!

@devinSpitz At least you get some Output, was not able to produce that haha, but I guess that's because of some strange behaviors when running with CPU. :)

Tavish77 commented 1 year ago

@devinSpitz I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?

Tavish77 commented 1 year ago

@devinSpitz I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?

I have already resolved it

masknetgoal634 commented 1 year ago

if you have an issue with 4090 try to install a new driver 525.105.17: https://www.nvidia.com/Download/driverResults.aspx/202351/en-us/

devinSpitz commented 1 year ago

@Tavish77 @masknetgoal634 Thanks both of you, yes I'm using a 4090 so I will update the driver and try it again :D

@donflopez thanks as well, you are right with the stop sequence "-" is a little bit better but still not good :(

@h1f0x Yes that's right xD But I still want to get it working :D

devinSpitz commented 1 year ago

@masknetgoal634 Im already on a newer driver xD

@Tavish77 how did you solve it?

karlklaustal commented 1 year ago

I made this work in a colab notebook with LLamaIndex and the Gpt4All model. But you can only load small text bits with llamaIndex. If you load more text the colab (non pro) crashes.Sure... sorry my quota on colab is always at max so I paste this

https://pastebin.com/mGuhEBQS

I copied this from my local jupyter so be aware. Some headings are not code.

like

" Load GPT4ALL-LORA Model"

Hope this helps. I now try to exchange the GPT4ALL-Lora with a 4bit version. But I am somehow stuck.

I only have a 6GB GPU.

Tavish77 commented 1 year ago

@devinSpitz

@Tavish77 how did you solve it?

I replaced a cloud GPU server

masknetgoal634 commented 1 year ago

@masknetgoal634 Im already on a newer driver xD

as far as i know there is only fix for 4090 in 525.105.17

ddb21 commented 1 year ago

Anybody make progress on this? Is it possible to use the CPU optimized (alpaca.cpp, etc) versions of Llama for creating embeddings or is a cloud service the only option here?

logan-markewich commented 1 year ago

@ddb21 you should be able to use llama cpp (or any llm that langchain has implemented) by wrapping the llm with the LLMPredictor class

https://github.com/hwchase17/langchain/tree/master/langchain/llms

And Here's the docs for using any custom model: https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model

And here's a ton of examples implementing random llms

https://github.com/autratec/GPT4ALL_Llamaindex

https://github.com/autratec/dolly2.0_3b_HFembedding_Llamaindex

https://github.com/autratec/koala_hfembedding_llamaindex

Just need to make sure you set up the prompt helper/service context appropriately for the input size of each model

iamadhee commented 1 year ago

@logan-markewich I tried out your approach with llama_index and langchain, with a custom class that I built for OpenAI's GPT3.5 model. But, it seems that llama_index is not recognizing my CustomLLM as one of langchain's models. It is defaulting to it's own GPT3.5 model. What am I doing wrong here? Attaching the codes and the logs. Thanks in advance.

from openAIComplete import OpenAI
from langchain.llms.base import LLM
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

OPENAI_API_KEY = 'API KEY'
yo =OpenAI(api_key=OPENAI_API_KEY,model='gpt-3.5-turbo')

class CustomLLM(LLM):
    model_name = 'OpenAI GPT-3'

    @property
    def _llm_type(self) -> str:
        return "custom"

    def _call(self, prompt: str,stop:str=None):
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        print(prompt)
        res = yo.run(prompt)
        return res 

    @property
    def _identifying_params(self):
        return {"name_of_model": self.model_name}

yo2 = CustomLLM()

from llama_index import LLMPredictor, ServiceContext, GPTListIndex, GPTSimpleVectorIndex, SimpleDirectoryReader, PromptHelper, LangchainEmbedding

def chatbot(directory_path, input_text):
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    llm_predictor = LLMPredictor(llm=CustomLLM())

    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper) # , embed_model=embed_model

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    index.save_to_disk('index.json')

    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(input_text, response_mode="compact",service_context=service_context)
    return response.response

print(chatbot('models/','Hi, what is this document about?'))

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 2721 tokens
Traceback (most recent call last):
  File "/workspaces/docify/models/test.py", line 55, in <module>
    print(chatbot('models/','Hi, what is this document about?'))
  File "/workspaces/docify/models/test.py", line 49, in chatbot
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 369, in load_from_disk
    return cls.load_from_string(file_contents, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 345, in load_from_string
    return cls.load_from_dict(result_dict, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 263, in load_from_dict
    return super().load_from_dict(result_dict, vector_store=vector_store, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 322, in load_from_dict
    return cls(index_struct=index_struct, docstore=docstore, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/vector_indices.py", line 69, in __init__
    super().__init__(
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 54, in __init__
    super().__init__(
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 69, in __init__
    self._service_context = service_context or ServiceContext.from_defaults()
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/service_context.py", line 69, in from_defaults
    llm_predictor = llm_predictor or LLMPredictor()
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/llm_predictor/base.py", line 164, in __init__
    self._llm = llm or OpenAI(temperature=0, model_name="text-davinci-003")
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for OpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` as a named parameter. (type=value_error)

Note: Sorry about the clumsy code, I'm testing things out

iamadhee commented 1 year ago

To set more context, this is openAIComplete.py :

from baseModel import Model
import openai
import tiktoken

class OpenAI(Model):
    def __init__(self,
                 api_key: str,
                 model: str,
                 api_wait: int = 60,
                 api_retry: int = 6,
                 temperature: float = .7):
        super().__init__(api_key, model, api_wait, api_retry)

        self.temperature = temperature
        self._verify_model()
        self.set_key(api_key)
        self.encoder = tiktoken.encoding_for_model(self.model)
        self.max_tokens = self.default_max_tokens(self.model)

    def supported_models(self):
        return {
            "text-davinci-003": "text-davinci-003 can do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models. Also supports inserting completions within text.",
            "text-curie-001": "text-curie-001 is very capable, faster and lower cost than Davinci.",
            "text-babbage-001": "text-babbage-001 is capable of straightforward tasks, very fast, and lower cost.",
            "text-ada-001": "text-ada-001 is capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.",
            "gpt-4": "More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat. Will be updated with our latest model iteration.",
            "gpt-3.5-turbo": "  Most capable GPT-3.5 model and optimized for chat at 1/10th the cost of text-davinci-003. Will be updated with our latest model iteration",
        }

    def _verify_model(self):
        """
        Raises a ValueError if the current OpenAI model is not supported.
        """
        if self.model not in self.supported_models():
            raise ValueError(f"Unsupported model: {self.model}")

    def set_key(self, api_key: str):
        self._openai = openai
        self._openai.api_key = api_key

    def get_description(self):
        return self.supported_models()[self.model]

    def get_endpoint(self):
        model = openai.Model.retrieve(self.model)
        return model["id"]

    def default_max_tokens(self, model_name: str):
        token_dict = {
            "text-davinci-003": 4000,
            "text-curie-001": 2048,
            "text-babbage-001": 2048,
            "text-ada-001": 2048,
            "gpt-4": 8192,
            "gpt-3.5-turbo": 4096,
        }
        return token_dict[model_name]

    def calculate_max_tokens(self, prompt: str) -> int:

        prompt = str(prompt)
        prompt_tokens = len(self.encoder.encode(prompt))
        max_tokens = self.default_max_tokens(self.model) - prompt_tokens

        print(prompt_tokens, max_tokens)
        return max_tokens

    def run(self, prompt:str):

        if self.model in ["gpt-3.5-turbo"]:
            prompt_template = [
                {"role": "system", "content": "you are a helpful assistant."}
            ]
            prompt_template.append({"role": "user", "content": prompt})
            max_tokens = self.calculate_max_tokens(prompt_template)
            response = self._openai.ChatCompletion.create(
                model=self.model,
                messages=prompt_template,
                max_tokens=max_tokens,
                temperature=self.temperature,
            )
            return response["choices"][0]["message"]["content"].strip(" \n")

        else:
            max_tokens = self.calculate_max_tokens(prompt)
            response = self._openai.Completion.create(
                model=self.model,
                prompt=prompt,
                max_tokens=max_tokens,
                temperature=self.temperature,
            )
            return response["choices"][0]["text"].strip("\n")

iamadhee commented 1 year ago

Found the issue with mine. Seems while instantiating another instance of GPTSimpleVectorIndex, I wasn't passing the service_context parameter.

index.save_to_disk('index.json')

index = GPTSimpleVectorIndex.load_from_disk('index.json',service_context=service_context)

scooter7 commented 1 year ago

Hi, I've developed a streamlit app that uses llama-index with openai. I'd like not to pay for openai and be able to leverage an open source llm that has no commercial restrictions, no token limits, and a hosted api. I've been looking at bloom - https://huggingface.co/bigscience/bloom - but don't know how to call the huggingface model in a similar manner to what I have in my current code.

Does anyone know how I would adapt that code to work with Bloom from HuggingFace?

Thanks!

import logging import streamlit as st from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper from langchain.chat_models import ChatOpenAI import sys from datetime import datetime import os from github import Github

if "OPENAI_API_KEY" not in st.secrets: st.error("Please set the OPENAI_API_KEY secret on the Streamlit dashboard.") sys.exit(1)

openai_api_key = st.secrets["OPENAI_API_KEY"]

logging.info(f"OPENAI_API_KEY: {openai_api_key}")

Set up the GitHub API

g = Github(st.secrets["GITHUB_TOKEN"]) repo = g.get_repo("scooter7/CXBot")

def construct_index(directory_path): max_input_size = 4096 num_outputs = 512 max_chunk_overlap = 20 chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

documents = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

index.directory_path = directory_path

index.save_to_disk('index.json')

return index

entrptaher commented 1 year ago

I see we can use https://github.com/lhenault/simpleAI to run a locally hosted openai alternative, but not sure if this can work with llama_index.

scooter7 commented 1 year ago

Interesting and thanks for sharing that! I will ultimately need a hosting environment beyond my local machine. Luckily, I'm finding some providers that are quite a bit more affordable than some of the big names.

logan-markewich commented 1 year ago

@entrptaher pretty much any LLM can work if you implement the CustomLLM class. Inside the class you could make API calls to someother hosted service or a local model

https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model-advanced

logan-markewich commented 1 year ago

Ok, going to link these docs one last time. If you want to avoid openai, you need to setup both an LLM and an embedding model in the service context.

To make things easier, I also recommend setting a global service context. If you use a langchain LLM, be sure to wrap it with the LangChainLLM class

from llama_index.llms import LangChainLLM
from llama_index import ServiceContext, set_global_service_context

llm = LangChainLLM(<langchain llm class>)
embed_model = <setup embed model>

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
set_global_service_context(service_context)

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-huggingface-llm https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#embedding-model-integrations https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#custom-embedding-model

run-llama / llama_index

How to use llama index with alpaca locally #928

Set up the GitHub API