Closed Hisma closed 10 months ago
Thanks for the suggestion, I will try to have a go at it next week!
Hi,
Please check the initial implementation here - https://github.com/snexus/llm-search/tree/feature-webapp-load-configs
This feature is simple on the surface, however, extra care should be taken with GPU memory. Since we are not exiting the app between the reloads, all objects taking the GPU memory (old configuration) have to be cleared before the new configuration can be loaded.
I've tested it with locally hosted models - it appears to be working. Caveat though - GPU memory can spike temporarily when you load a new config due to the fact that the garbage collector in Python might take a bit of time to get rid of the old object.
Edit:
Usage: llmsearch interact webapp -c config_folder
excellent, will do!
once again, I applaud your work and how fast you turn things around.
I did a good test by first loading a 13B model that fits easily in my 48GB of VRAM. No problem. Then I tried to have it load a 70B model, which uses 40GB of VRAM. Unfortunately it ran out of memory. When I exited the program and just loaded the 70B model first without loading the 13B model, it loaded into memory fine. For reference, this is what it looks like when I have the 70B model loaded -
You can see it almost fills my memory completely.
So this is a good test of the limits of this feature capability.
I do know this has to be solvable though. I use other model loaders, particulary textgen-webui.
https://github.com/oobabooga/text-generation-webui
This app is designed to allow you to download & change between models at will, and it manages to have the ability to switch models without any issues.
So should be a way to pull this off, if you want to keep working at it.
Perhaps look at the textgen-webui code and see if you can gain some insights? it's based on gradio tho and not streamlit, so it's probably tough to compare.
Anyway, I don't think this is quite ready without some additional tweaking to get the previous model out of GPU memory before loading the new one. Hope you can try to fix it.
Also worth noting, I went from running a local llm to running GPT-4 openai, which obviously shouldn't use any video memory at all. I watched my memory for several minutes, and it never went down. So I think the way you currently have this implemented is not unloading the previous model at all.
I would expect my memory usage to be very low with the GPT-4 model configuration loaded.
also, when I ran the same query (same question) after switching from the local llm to GPT-4, it just repeated the exact same answer from the local llm. It didn't re-run the query using GPT-4 to get a new response.
After closing the app, here's what my video memory looks like -
So definitely once the app is killed, the model gets unloaded from memory.
Thanks for the extensive testing, really appreciate it. Can confirm the problem - as you mentioned, it can be clearly seen when loading offline followed by an OpenAI model - the memory should drop significantly.
Will need to investigate deeper and check what oobabooga
does.
Believe the memory issue is fixed at least for llama-cpp
based models, hope it works for you.
If you repeat the following experiment again, please pay attention that there should be still some amount (around 1-2GB) of GPU memory used, by the embedding model.
Also worth noting, I went from running a local llm to running GPT-4 openai, which obviously shouldn't use any video memory at all.
good job! Looks like you figured out how to unload the llama models. I watched the console logs and I saw that it did indeed unload the model this time before loading the new one - I ran a test by loading a 13B model, asking a question, then loading a 70B model.
It's okay for llama-cpp models as that's all I'm using at the moment. If you can reach the point where you want to try to get this working with other local llm loaders, feel free, but I'm happy as long as llama.cpp is working :).
However, I did find a bug when I tried to go from the 70B model to openai-
I accidentally clicked the "load" button twice and that happened.
This was just a silly mistake. I realize that this is largely a POC application, but could you at least make it where the program won't crash if someone accidentally presses the "load" button a second time while the model is loading? It's quite easy to accidentally double-click when you're trying to single click, which is what happened in my case.
Perhaps it's possible to make the load button unclickable until the model is fully loaded?
I did another test & checked if the program would crash or do something funny if I tried to load the same config twice, and it handled that well, so that's good :). If you can just take care of the crash from accidentally clicking the load button twice while a model is actively loading, I think this feature will be ready to go!
Thanks for confirming about the memory, you helped make this project better! I will fix the app crash bug due to multiple load button click. Regarding non llama-cpp models - it might work out of the box, but please free to raise an issue if specific model type isn't unloaded properly.
I've pushed a new version, please check if it works for you. The solution isn't as elegant as I hoped it to be due to the tricky nature of Streamlit's event loop. If you are quick enough to press multiple times before the model starts to load, the loading process will stop. In that case, you might need to press the button again to load the model. The app shouldn't crash though when pressing twice or more.
As you probably noticed, there are some limitations with the Streamlit front-end. In the future, I might rewrite it in a different framework, such as Gradio or NiceGUI. Or perhaps there is an option to offload model handling to 3rd party, e.g. oobabooga via API, so can focus more on the RAG logic itself...
yeah, I do admit that offloading the handling of the models & just focusing on the RAG would probably be a good pivot to make at some point. I'm using llama.cpp right now because your template supports it well. I think you have other model loaders but it's such a fast moving environment its hard to keep up. I personally try to use exllama2 whenever I can (exl2 models), which I'm not sure you're currently supporting. Also, with the different models, you have to take into account what prompt template to use (you use an alpaca template by default, many models use other formats), there's just so many factors to consider when you start dealing with local models. But I also have to say as a pure RAG implementation, I find this application very elegant and most of all, performant.
I see the RAG space is rapidly evolving too, so a singular focus there while, as you said, offloading the model handling to something else, sounds logical.
But anyway, I don't want to sound negative. You fixed the bug from what I can see. I clicked the "load" button multiple times as a model was loading and it didn't crash. Your worst case scenario of having to press "load" again sounds like a rare edge case, and even then it doesn't cause the program to crash.
Thank you again for implementing my requested features. This really does exactly what I need it to do now. In fact, I've found that with a 70B model (synthia v1.5), I get better answers to my queries than I do using GPT-4! Quite amazing what a good local LLM enhanced with a good RAG implementation can achieve. Keep up the good work.
Completely agree, focusing mostly on RAG will be the best way forward. I will investigate what's the best way to integrate with existing solutions to handle the local models. Thank you for the testing and suggestions, glad you are getting value from the app.
hmmm... I tested this today and strangely, as I tried to load different models, it kept reloading the same model over and over, despite choosing different configs. Can you please confirm if you have the same behavior? I think perhaps when you fixed the bug with hitting the load button more than once not breaking the app, it perhaps made it where it won't switch models correctly. I swear I tested this before, but I just used it today and I hadn't touched the code since.
Can you describe under which circumstances it is happening? I’ve used it yesterday, it was reloading it properly. is it possible that you had multiple instances of web interface opened?
it happened by simply loading the app the same way I always had in the past. it's possible that when I called the issue resolved, I only quickly tested that it didn't crash hitting the button multiple times, but never checked proper loading between config files again.
I went as far as to delete my original version of the app, pull the latest version from git, then re-build the venv and try again. It still has the same issue with pressing the "load" button just reloads the same config file over again.
So I did another test by reverting the webapp.py
file back to the version before you fixed the crashing issue.
https://github.com/snexus/llm-search/blob/05b815c7fa11781eb88c1ba2380a11a32a5c3cbf/src/llmsearch/webapp.py
When I use this version of webapp.py
, I am able to properly load the files again. It does crash if I hit the load button mid-loading, as expected, but I can properly change config files again.
So not sure why I'm having different results. Ultimately, we're using the same source files, so I can't see how it's only affecting me not you.
If not too much to ask, can you try rebuilding from the latest like I did and see if it still works for you?
In the meantime, I can see if I can troubleshoot and find the problem myself, though I may get to it tomorrow. But I can confirm that clearing the app completely & re-building it, I still have the same problem.
Thanks again.
Thanks for the details, I will rebuild from scratch and test as you suggested.
Using a clean environment, can confirm the bug. Please test the fix in https://github.com/snexus/llm-search/tree/fix-webapp-config-reloading, will merge it to the main if it works for you.
yep, all good now! I spent extra time doing various tests. Trying different prompts, loading different llms, and spent some time with it making sure I covered my normal use cases.
Can't find any issues. Thanks for the quick fix!
Hi @snexus, I'd like to re-open this one, because I found an issue. It's one that we didn't really anticipate, but I really think it would be helpful to be added.
Right now, with this feature, you can update existing embeddings. But let's say you're running the app for the first time and you have no embeddings yet, or you want to create a new embeddings with an updated config file, then the app will crash on start-up if it doesn't detect existing embeddings.
Could you update the code so that it checks for existing embeddings on start-up, and if it doesn't exist, then you can use the "update embeddings" button to add new embeddings based on the config file?
Not sure how much complexity adding this embeddings check would add, because you'd obviously need to disallow queries until the embeddings exist.
But the whole point of adding this feature is that you can simply install the application and run it, choose a config file and go. No need to do an initial update embeddings command before you run the streamlit UI. You can instead do everything you need to do from within the UI right after installing the app.
Thanks!
Good suggestion, the current approach is indeed not optimal from the UX perspective. Will try to implement it this or next week.
Please check the implementation here - https://github.com/snexus/llm-search/tree/webui-embeddings-generation
The current implementation will expose the embeddings generation only if embeddings aren't detected - in order not to clutter the UI with functionality that is used only once, at the beginning. If the user needs to recreate the index - can simply delete the embedding folder
Excellent! I will test it today.
I actually found another bug with this implementation yesterday. See if you can recreate it.
What I did is I have multiple config files that point to files in different subfolders.
ie a 'config-hydrogen.yaml' that ingests docs in a folder called "c:\docs\hydrogen" and another config file 'config-protein.yaml' that is set up to ingest docs in a folder called 'c:\docs\protein'.
What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.
I know this is not so easy to explain so I hope I didn't confuse you. But in summary, changing configuration files inside the UI, while it does change the llm correctly, it doesn't respect changes in document path.
Can you look into this? Even though it was difficult to explain, I have a feeling this is probably an easy fix.
But in the meantime, I will also test the update you just made too.
Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.
What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.
Are you pressing "Load" when selecting the second ( 'c:\docs\protein') configuration? There is no way around it - just selecting it without pressing the button won't load it automatically (it was the initial approach, but quite fragile to implement in Streamlit). Configurations are considered independent, with own settings, model etc - so have to be reloaded completely, not just pointing to a new embedding folder.
Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.
No worries at all. Thanks for finding and reporting bugs, and suggesting useful features.
What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.
Are you pressing "Load" when selecting the second ( 'c:\docs\protein') configuration? There is no way around it - just selecting it without pressing the button won't load it automatically (it was the initial approach, but quite fragile to implement in Streamlit). Configurations are considered independent, with own settings, model etc - so have to be reloaded completely, not just pointing to a new embedding folder.
Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.
No worries at all. Thanks for finding and reporting bugs, and suggesting useful features.
Yes of course, I am actually loading the configuration. As I said, it loads fine, and it will change llms correctly, but if I query the app for something that should be in the protein folder, per the configuration, it will give me a result with a negative score and all the sources it lists point to files from the 'hydrogen' folder, which was loaded in the previous configuration.
let me see if I can quickly recreate it again...
good job! Looks like you did fix the issue that the program would crash it no embeddings were detected. I also appreciate how clean it looks from a UI perspective.
Just have a small spelling error to fix. Very minor, but "creading" should be "creating"
I will move on to try and recreate the bug I was talking about...
I recreated the bug.
Here's what I did -
doc_path
contains a user manual for a iot device. When I loaded it, I had to update embeddings, so I updated the embeddings so the files could be added to the db. I asked a question about the document after it loaded, and it answered it just fine.
doc_path
should have changed, since config_tech.yaml
points to c:\docs\tech
directory. config_tech
configuration file, and acknowledges the correct doc_path
, it used the doc_path
from the previous configuration when I ran the query, giving me a negative result.Hope this is clear now.
I will say however, if I do choose "update embeddings" after loading the config file, then ask the same question, it will answer it correctly. But I shouldn't need to update my embeddings every time I change my doc_path
, should I? I would only expect it be necessary to update embeddings if I actually add or modify files within the doc_path
.
Here's showing a correct result after I chose update embeddings
.
Thanks for the detailed explanation, will try to reproduce it on my side.
I will say however, if I do choose "update embeddings" after loading the config file, then ask the same question, it will answer it correctly. But I shouldn't need to update my embeddings every time I change my doc_path, should I?
You definitely shouldn't need to update embeddings after changing the doc path
Unfortunately couldn't reproduce the issue - I followed the exact steps with 2 document sets/configs.
Having a hypothesis though - are you able to share the configs you used? Is it possible that embedding_path
was the same for both of your tests?
`cache_folder: /home/hisma/rag/cache ## specify a cache folder for embeddings models, huggingface and sentence transformers
embeddings: embeddings_path: /home/hisma/rag/embeddings ## specify a folder where embeddings will be saved.
embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory type: sentence_transformer # other supported types - "huggingface" and "instruct" model_name: "BAAI/bge-large-en-v1.5"
splade_config: # Optional batch size of sparse embeddings. Reduced if getting out-of-memory errors on CUDA. n_batch: 5
chunk_sizes: # Specify one more chunk size to split (querying multi-chunk results will be slower)
1024
document_settings:
scan_extensions: # specifies files extensions to scan recursively in doc_path
.
passage_prefix: "passage: " # Often, specific prefix needs to be included in the source text, for embedding models to work properly label: "tech" # Add a label to the current collection
semantic_search: search_type: similarity # Currently, only similarity is supported replace_output_path: # Can specify list of search/replace settings
substring_search: "/mnt/c/" ## Specifies substring to replace in the output path of the document substring_replace: "C:/" ## Replaces with this string append_suffix: # Specifies additional template to append to an output path, useful for deep linking append_template: "#page={page}" # For example will append a page from metadata of the document parser
max_char_size: 4096 query_prefix: "query: " # Often queries have to be prefixed for embedding models, such as e5
persist_response_db_path: "/home/hisma/rag/responsedb/response.db" # optional sqlite database filename. Allows to save responses offlien to sqlite, for future analysis.
############ An example how to use OpenAI model, requires .env file with the OpenAI key llm: type: openai params: prompt_template: | Context information is provided below. Given the context information and not prior knowledge, provide detailed answer to the question.
### Context:
---------------------
{context}
---------------------
### Question: {question}
model_kwargs:
temperature: 0.2
model_name: gpt-3.5-turbo-0301`
thats my standard config when using chatgpt. Yes embeddings_path
is consistent across both config files.
Are you suggesting that we should be creating multiple dbs?
Yes, the embedding path should be unique per config file. That's where the vector database embeddings are generated for the specific document collection. Using the same embedding_path
and clicking "update" would overwrite the embeddings, which explains the behaviour.
Perhaps worth mentioning in the readme...
Ahh... that's very good to know! Yeah, definitely worth mentioning in the readme.
I will update my configurations with unique embeddding paths for each file and hopefully that works. I'll test it later and report the results, then we can close this again.
tested. It worked as you suggested. And it makes total sense that you'd want different embeddings paths for different sub-topics. I didn't want to mix different "domains" of information, for worry of making the searches less accurate, especially as the amount of files begins to grow.
Please close this and you can merge this change.
Thanks for confirming.
By the way - in case you do need to have multiple sub-topics in the same config file (and same embeddings path) - it is already supported. You can specify multiple document paths in the config, pointing to different source folders and scan settings, as shown below. The advantage of doing that - you can filter by label in the UI, rather than loading a new configuration file, but for large collections, it might work slower.
wow I did not know about that either! And that would make the "document collection" tag actually useful since you can filter by tag! I always just had one per document so I didn't think the feature had any practical use. Now it makes sense.
OK I have one more request I'm hoping you can implement for this wonderful app.
The way I use the app, and I imagine others would too, is I am testing various models & verifying the performance to see which gives best results. The way I do this is create a config.yaml file for each model I use that is set up as needed. I then created a bash script that runs the webapp with the different config files as an argument, depending on what model I want to test. If you had this feature inside of the UI it would be extremely handy and I can see this as something other users would appreciate too. A summary
This is really the only "quality of life" feature this app is missing for me, since I have about 4 different config files I switch between because I want to test the various llms performance against my questions.
Hope this is possible!