Closed 1Mark closed 1 year ago
@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca
Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.
It's going to be different for every model, but it's not too bad 😄
@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca
Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.
It's going to be different for every model, but it's not too bad 😄
Thank you. Do you have any examples?
@1Mark I personally haven't used llama or alpaca. How are you loading the model and generating text right now?
here's a very rough example with some fake functions to kind of show what I mean
def load_alpaca():
...
return model
class CustomLLM(LLM):
model = load_alpaca()
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
prompt_length = len(prompt)
response_text = self.model(prompt)
# only return newly generated tokens
return response_text[prompt_length:]
@property
def _identifying_params(self) -> Mapping[str, Any]:
return {"name_of_model": self.model_name}
@property
def _llm_type(self) -> str:
return "custom"
Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. For instance, you could make use of the tloen/alpaca-lora-7b implementation. If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.
If someone's able to get alpaca or llama working with llamaindex lmk! would be a cool demo to show :)
Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. For instance, you could make use of the tloen/alpaca-lora-7b implementation. If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.
tloen/alpaca-lora-7b doesn't seem to have its own inference api https://huggingface.co/tloen/alpaca-lora-7b#:~:text=Unable%20to%20determine%20this%20model%E2%80%99s%20pipeline%20type.%20Check%20the%20docs%20%20.
This issue here seems quite relevant https://github.com/tloen/alpaca-lora/issues/45
@1Mark the code in that repo could easily be adapted to work with llama index. (I.e. generate.py
). Just need to move the model loading and inference code into the custom LLM class
something along the lines works with pip -q install git+https://github.com/huggingface/transformers
:
from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline
tokenizer = LlamaTokenizer.from_pretrained("chavinlo/alpaca-native")
base_model = LlamaForCausalLM.from_pretrained(
"chavinlo/alpaca-native",
load_in_8bit=True,
device_map='auto',
)
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=256,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.2
)
local_llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(prompt=prompt, llm=local_llm)
@knoopx nice! So if that's wrapped into the CustomLLM class from above and passed as an LLMPredictor LLM, the integration should work!
How well it works is up to the model though lol
can i combine your code in this way LLMPredictor(llm=local_llm)
@Tavish77 not quite. You'll still need to wrap it in that class that extends the LLM
class. I had an example posted further above 👍🏻
Then you instantiate that class and pass it in like you did there
@logan-markewich I'm trying to combine the examples you posted above. What do you return as the model from load_alpaca()
method? Do you return llm_chain
? Can you post the full example here?
Hey, I'm loading a peft.PeftModel.from_pretrained
and following the instructions in this thread and in here but I get multiple errors:
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
caaaaling
Token indices sequence length is longer than the specified maximum sequence length for this model (1622 > 1024). Running this sequence through the model will result in indexing errors
/home/donflopez/.local/lib/python3.10/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [66,0,0] Assertion `srcIndex
....many more with the same...
Traceback (most recent call last):
....many hops...
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Does anybody know what's going on? Thanks!
EDIT for adding more context:
If I use the this model 'decapoda-research/llama-7b-hf
I get an error like:
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
The code in the file is how far i got to work with llama_index. Some one knows what im doing wrong?
Exception happen while the pipeline command:
Traceback (most recent call last):
File "/workspace/LLama-Hub/main2.py", line 68, in <module>
class CustomLLM(LLM):
File "/workspace/LLama-Hub/main2.py", line 79, in CustomLLM
pipeline = pipeline(
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/init.py", line 979, in pipeline
return pipeline_class(model=model, framework=framework, task=task, kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 63, in init
super().init(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in init
self.model.to(device)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 6 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py
It is not perfect, but works...
I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py
It is not perfect, but works...
@donflopez In order to get your code running, I had to install transformers 4.28.0.dev0
(so building from github), but I'm still getting the following error now:
RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'BertTokenizerFast' from 'transformers.models.bert'
Did you encounter this at all? (and how did you fix it?)
@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly. @devinSpitz Did you get that sorted out? Got the same issue with a modified version myself, any luck so far?
@h1f0x i could get @donflopez's repo to work but i always got completly wrong anwers or some times nothing (+- the same that i now have with this version xD). But with it i was able to get further but still with no usable response.
The model that should have "read" the documents (Llama document and the pdf from the repo) does not give any usefull answer anymore.
this was with: base_model= circulus/alpaca-7b and the lora weight was circulus/alpaca-lora-7b i did try other models or combinations but i did not get any better result :(
Question: What do you think of Facebook's LlaMa? befor reposnse: I think Facebook’s LLAMA (Learn, Launch and Maintain Audience) initiative is an excellent program which can help businesses of all sizes to reach their target audiences more effectively. It provides valuable resources such as training materials, tools and best practices for launching, maintaining and engaging with an audience on social media platforms. after: Output should include references to sources where applicable.
This shows that something does work or at least not break? Question: What is the capital of England? befor reposnse: The capital of England is London. after: The capital of England is London.
Question: What are alpacas? and how are they different from llamas? befor reposnse: Alpacas are small, domesticated animals related to camels and native to South America. They are typically smaller than llamas and have finer fleeces which make them ideal for fiber production. Alpacas are also more docile and easier to handle than llamas. after: Output should include references to sources used to create the output.
Code: https://gist.github.com/devinSpitz/73cd7037b82d7acbe70ddf4d1c61ba4a
@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly.
@h1f0x I'm running on a 4090 too, yes, multiple executions fail and also you cannot go beyond 1 beams.
I'm trying to figure out why this happens. When querying the raw model, this does not happen, it probably has something to do with llama_index + the pipeline setup.
@devinSpitz, I also have weird results. Please note that in my code I have a .
as stop sequence. I'm still trying to find a stop sequence that works properly for llama_index. For me, the main issue I find in the model is that it tries to reapeat the llama_index prompt as a pattern instead of stopping at the right place.
I'm getting this with -
as stop sequence. A bunch of non-sense after the first dot, the vram goes up to 23.5GB and after that runs OOM.
Question: How many people lives in Martos?
Answer: According to data provided by INE, there are currently approximately 24 thousand two hundred seventeen residents living within the municipal boundaries of Martos. # Lijst van voetbalinterlands Oman - Saudi Arabië
Deze lijst van voetbalinterlands geeft een overzicht van alle officiële interlands tussen het nationale elftal van Oman en dat van Saudi-
@devinSpitz I got this outpout tweaking your script to make it work with index, still llama doesn't know when to stop. Using -
as stop sequence. -> https://gist.github.com/donflopez/535e5ecb85b79233c7cf74fd977eb87f
Improved it, here is the latest output: https://gist.github.com/donflopez/39bb9bc34cc00467679f10bab3e4a734
@h1f0x lookslike the OOM issue doesn't happen on the script, so it could be gradio that copies the resources when making a request? I have no idea how gradio works tbh, but if I move things out of gradio, there's no OOM.
I have been trying to get this to work as well, but keep running into issues with sentencepiece: return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string
Anyone else having this or any suggestions? Thanks!
Just like inference with OpenAI APIs doesn't happen locally, is there any way to use HTTP requests to send the prompts to a server exposing any LLM like Alpaca via HTTP? I feel like it would be easier if we could decouple the LLM.
@Tavish77 not quite. You'll still need to wrap it in that class that extends the
LLM
class. I had an example posted further above 👍🏻Then you instantiate that class and pass it in like you did there
thank you i have solved my porblem
@donflopez Many thanks for your feedback! I got it working with CPU only later that evening but I needed to change the page management in windows itself too to get it working. I hope I can try some new settings soon. Gradio is a mystery for myself as well :D At least so far.. looking into that deeper as well. If I find anything I let you know!
@devinSpitz At least you get some Output, was not able to produce that haha, but I guess that's because of some strange behaviors when running with CPU. :)
@devinSpitz I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?
@devinSpitz I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?
I have already resolved it
if you have an issue with 4090 try to install a new driver 525.105.17: https://www.nvidia.com/Download/driverResults.aspx/202351/en-us/
@Tavish77 @masknetgoal634 Thanks both of you, yes I'm using a 4090 so I will update the driver and try it again :D
@donflopez thanks as well, you are right with the stop sequence "-" is a little bit better but still not good :(
@h1f0x Yes that's right xD But I still want to get it working :D
@masknetgoal634 Im already on a newer driver xD
@Tavish77 how did you solve it?
I made this work in a colab notebook with LLamaIndex and the Gpt4All model. But you can only load small text bits with llamaIndex. If you load more text the colab (non pro) crashes.Sure... sorry my quota on colab is always at max so I paste this
I copied this from my local jupyter so be aware. Some headings are not code.
like
" Load GPT4ALL-LORA Model"
Hope this helps. I now try to exchange the GPT4ALL-Lora with a 4bit version. But I am somehow stuck.
I only have a 6GB GPU.
@devinSpitz
@Tavish77 how did you solve it?
I replaced a cloud GPU server
@masknetgoal634 Im already on a newer driver xD
as far as i know there is only fix for 4090 in 525.105.17
Anybody make progress on this? Is it possible to use the CPU optimized (alpaca.cpp, etc) versions of Llama for creating embeddings or is a cloud service the only option here?
@ddb21 you should be able to use llama cpp (or any llm that langchain has implemented) by wrapping the llm with the LLMPredictor class
https://github.com/hwchase17/langchain/tree/master/langchain/llms
And Here's the docs for using any custom model: https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model
And here's a ton of examples implementing random llms
https://github.com/autratec/GPT4ALL_Llamaindex
https://github.com/autratec/dolly2.0_3b_HFembedding_Llamaindex
https://github.com/autratec/koala_hfembedding_llamaindex
Just need to make sure you set up the prompt helper/service context appropriately for the input size of each model
@logan-markewich I tried out your approach with llama_index and langchain, with a custom class that I built for OpenAI's GPT3.5 model. But, it seems that llama_index is not recognizing my CustomLLM as one of langchain's models. It is defaulting to it's own GPT3.5 model. What am I doing wrong here? Attaching the codes and the logs. Thanks in advance.
from openAIComplete import OpenAI
from langchain.llms.base import LLM
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
OPENAI_API_KEY = 'API KEY'
yo =OpenAI(api_key=OPENAI_API_KEY,model='gpt-3.5-turbo')
class CustomLLM(LLM):
model_name = 'OpenAI GPT-3'
@property
def _llm_type(self) -> str:
return "custom"
def _call(self, prompt: str,stop:str=None):
if stop is not None:
raise ValueError("stop kwargs are not permitted.")
print(prompt)
res = yo.run(prompt)
return res
@property
def _identifying_params(self):
return {"name_of_model": self.model_name}
yo2 = CustomLLM()
from llama_index import LLMPredictor, ServiceContext, GPTListIndex, GPTSimpleVectorIndex, SimpleDirectoryReader, PromptHelper, LangchainEmbedding
def chatbot(directory_path, input_text):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=CustomLLM())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper) # , embed_model=embed_model
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
index.save_to_disk('index.json')
index = GPTSimpleVectorIndex.load_from_disk('index.json')
response = index.query(input_text, response_mode="compact",service_context=service_context)
return response.response
print(chatbot('models/','Hi, what is this document about?'))
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 2721 tokens
Traceback (most recent call last):
File "/workspaces/docify/models/test.py", line 55, in <module>
print(chatbot('models/','Hi, what is this document about?'))
File "/workspaces/docify/models/test.py", line 49, in chatbot
index = GPTSimpleVectorIndex.load_from_disk('index.json')
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 369, in load_from_disk
return cls.load_from_string(file_contents, **kwargs)
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 345, in load_from_string
return cls.load_from_dict(result_dict, **kwargs)
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 263, in load_from_dict
return super().load_from_dict(result_dict, vector_store=vector_store, **kwargs)
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 322, in load_from_dict
return cls(index_struct=index_struct, docstore=docstore, **kwargs)
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/vector_indices.py", line 69, in __init__
super().__init__(
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 54, in __init__
super().__init__(
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 69, in __init__
self._service_context = service_context or ServiceContext.from_defaults()
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/service_context.py", line 69, in from_defaults
llm_predictor = llm_predictor or LLMPredictor()
File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/llm_predictor/base.py", line 164, in __init__
self._llm = llm or OpenAI(temperature=0, model_name="text-davinci-003")
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for OpenAI
__root__
Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)
Note: Sorry about the clumsy code, I'm testing things out
To set more context, this is openAIComplete.py
:
from baseModel import Model
import openai
import tiktoken
class OpenAI(Model):
def __init__(self,
api_key: str,
model: str,
api_wait: int = 60,
api_retry: int = 6,
temperature: float = .7):
super().__init__(api_key, model, api_wait, api_retry)
self.temperature = temperature
self._verify_model()
self.set_key(api_key)
self.encoder = tiktoken.encoding_for_model(self.model)
self.max_tokens = self.default_max_tokens(self.model)
def supported_models(self):
return {
"text-davinci-003": "text-davinci-003 can do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models. Also supports inserting completions within text.",
"text-curie-001": "text-curie-001 is very capable, faster and lower cost than Davinci.",
"text-babbage-001": "text-babbage-001 is capable of straightforward tasks, very fast, and lower cost.",
"text-ada-001": "text-ada-001 is capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.",
"gpt-4": "More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat. Will be updated with our latest model iteration.",
"gpt-3.5-turbo": " Most capable GPT-3.5 model and optimized for chat at 1/10th the cost of text-davinci-003. Will be updated with our latest model iteration",
}
def _verify_model(self):
"""
Raises a ValueError if the current OpenAI model is not supported.
"""
if self.model not in self.supported_models():
raise ValueError(f"Unsupported model: {self.model}")
def set_key(self, api_key: str):
self._openai = openai
self._openai.api_key = api_key
def get_description(self):
return self.supported_models()[self.model]
def get_endpoint(self):
model = openai.Model.retrieve(self.model)
return model["id"]
def default_max_tokens(self, model_name: str):
token_dict = {
"text-davinci-003": 4000,
"text-curie-001": 2048,
"text-babbage-001": 2048,
"text-ada-001": 2048,
"gpt-4": 8192,
"gpt-3.5-turbo": 4096,
}
return token_dict[model_name]
def calculate_max_tokens(self, prompt: str) -> int:
prompt = str(prompt)
prompt_tokens = len(self.encoder.encode(prompt))
max_tokens = self.default_max_tokens(self.model) - prompt_tokens
print(prompt_tokens, max_tokens)
return max_tokens
def run(self, prompt:str):
if self.model in ["gpt-3.5-turbo"]:
prompt_template = [
{"role": "system", "content": "you are a helpful assistant."}
]
prompt_template.append({"role": "user", "content": prompt})
max_tokens = self.calculate_max_tokens(prompt_template)
response = self._openai.ChatCompletion.create(
model=self.model,
messages=prompt_template,
max_tokens=max_tokens,
temperature=self.temperature,
)
return response["choices"][0]["message"]["content"].strip(" \n")
else:
max_tokens = self.calculate_max_tokens(prompt)
response = self._openai.Completion.create(
model=self.model,
prompt=prompt,
max_tokens=max_tokens,
temperature=self.temperature,
)
return response["choices"][0]["text"].strip("\n")
Found the issue with mine. Seems while instantiating another instance of GPTSimpleVectorIndex, I wasn't passing the service_context parameter.
index.save_to_disk('index.json')
index = GPTSimpleVectorIndex.load_from_disk('index.json',service_context=service_context)
Hi, I've developed a streamlit app that uses llama-index with openai. I'd like not to pay for openai and be able to leverage an open source llm that has no commercial restrictions, no token limits, and a hosted api. I've been looking at bloom - https://huggingface.co/bigscience/bloom - but don't know how to call the huggingface model in a similar manner to what I have in my current code.
Does anyone know how I would adapt that code to work with Bloom from HuggingFace?
Thanks!
import logging import streamlit as st from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper from langchain.chat_models import ChatOpenAI import sys from datetime import datetime import os from github import Github
if "OPENAI_API_KEY" not in st.secrets: st.error("Please set the OPENAI_API_KEY secret on the Streamlit dashboard.") sys.exit(1)
openai_api_key = st.secrets["OPENAI_API_KEY"]
logging.info(f"OPENAI_API_KEY: {openai_api_key}")
g = Github(st.secrets["GITHUB_TOKEN"]) repo = g.get_repo("scooter7/CXBot")
def construct_index(directory_path): max_input_size = 4096 num_outputs = 512 max_chunk_overlap = 20 chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index.directory_path = directory_path
index.save_to_disk('index.json')
return index
I see we can use https://github.com/lhenault/simpleAI to run a locally hosted openai alternative, but not sure if this can work with llama_index.
Interesting and thanks for sharing that! I will ultimately need a hosting environment beyond my local machine. Luckily, I'm finding some providers that are quite a bit more affordable than some of the big names.
@entrptaher pretty much any LLM can work if you implement the CustomLLM class. Inside the class you could make API calls to someother hosted service or a local model
Ok, going to link these docs one last time. If you want to avoid openai, you need to setup both an LLM and an embedding model in the service context.
To make things easier, I also recommend setting a global service context. If you use a langchain LLM, be sure to wrap it with the LangChainLLM class
from llama_index.llms import LangChainLLM
from llama_index import ServiceContext, set_global_service_context
llm = LangChainLLM(<langchain llm class>)
embed_model = <setup embed model>
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
set_global_service_context(service_context)
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-huggingface-llm https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#embedding-model-integrations https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#custom-embedding-model
I want to use llamaindex but I don't want any data of mine to be transferred to any servers. I want it all to happen locally or within my own EC2 instance. I have seen https://github.com/jerryjliu/llama_index/blob/046183303da4161ee027026becf25fb48b67a3d2/docs/how_to/custom_llms.md#example-using-a-custom-llm-model but it calls hugging face.
My plan was to use https://github.com/cocktailpeanut/dalai with the alpaca model then somehow use llamaindex to input my dataset. Any examples or pointers for this?