snexus / llm-search

Querying local documents, powered by LLM
MIT License
481 stars 60 forks source link

feature request - load config files from inside of UI #58

Closed Hisma closed 10 months ago

Hisma commented 11 months ago

OK I have one more request I'm hoping you can implement for this wonderful app.

The way I use the app, and I imagine others would too, is I am testing various models & verifying the performance to see which gives best results. The way I do this is create a config.yaml file for each model I use that is set up as needed. I then created a bash script that runs the webapp with the different config files as an argument, depending on what model I want to test. If you had this feature inside of the UI it would be extremely handy and I can see this as something other users would appreciate too. A summary

This is really the only "quality of life" feature this app is missing for me, since I have about 4 different config files I switch between because I want to test the various llms performance against my questions.

Hope this is possible!

snexus commented 11 months ago

Thanks for the suggestion, I will try to have a go at it next week!

snexus commented 11 months ago

Hi,

Please check the initial implementation here - https://github.com/snexus/llm-search/tree/feature-webapp-load-configs

This feature is simple on the surface, however, extra care should be taken with GPU memory. Since we are not exiting the app between the reloads, all objects taking the GPU memory (old configuration) have to be cleared before the new configuration can be loaded.

I've tested it with locally hosted models - it appears to be working. Caveat though - GPU memory can spike temporarily when you load a new config due to the fact that the garbage collector in Python might take a bit of time to get rid of the old object.

Edit: Usage: llmsearch interact webapp -c config_folder

Hisma commented 11 months ago

excellent, will do!

Hisma commented 11 months ago

once again, I applaud your work and how fast you turn things around.

I did a good test by first loading a 13B model that fits easily in my 48GB of VRAM. No problem. Then I tried to have it load a 70B model, which uses 40GB of VRAM. Unfortunately it ran out of memory. When I exited the program and just loaded the 70B model first without loading the 13B model, it loaded into memory fine. For reference, this is what it looks like when I have the 70B model loaded - image

You can see it almost fills my memory completely. So this is a good test of the limits of this feature capability. I do know this has to be solvable though. I use other model loaders, particulary textgen-webui. https://github.com/oobabooga/text-generation-webui This app is designed to allow you to download & change between models at will, and it manages to have the ability to switch models without any issues. So should be a way to pull this off, if you want to keep working at it.
Perhaps look at the textgen-webui code and see if you can gain some insights? it's based on gradio tho and not streamlit, so it's probably tough to compare. Anyway, I don't think this is quite ready without some additional tweaking to get the previous model out of GPU memory before loading the new one. Hope you can try to fix it.

Hisma commented 11 months ago

Also worth noting, I went from running a local llm to running GPT-4 openai, which obviously shouldn't use any video memory at all. I watched my memory for several minutes, and it never went down. So I think the way you currently have this implemented is not unloading the previous model at all. image

I would expect my memory usage to be very low with the GPT-4 model configuration loaded.

Hisma commented 11 months ago

also, when I ran the same query (same question) after switching from the local llm to GPT-4, it just repeated the exact same answer from the local llm. It didn't re-run the query using GPT-4 to get a new response.

Hisma commented 11 months ago

After closing the app, here's what my video memory looks like - image

So definitely once the app is killed, the model gets unloaded from memory.

snexus commented 11 months ago

Thanks for the extensive testing, really appreciate it. Can confirm the problem - as you mentioned, it can be clearly seen when loading offline followed by an OpenAI model - the memory should drop significantly.

Will need to investigate deeper and check what oobabooga does.

snexus commented 11 months ago

Believe the memory issue is fixed at least for llama-cpp based models, hope it works for you.

If you repeat the following experiment again, please pay attention that there should be still some amount (around 1-2GB) of GPU memory used, by the embedding model.

Also worth noting, I went from running a local llm to running GPT-4 openai, which obviously shouldn't use any video memory at all.

Hisma commented 11 months ago

good job! Looks like you figured out how to unload the llama models. I watched the console logs and I saw that it did indeed unload the model this time before loading the new one - image I ran a test by loading a 13B model, asking a question, then loading a 70B model.

It's okay for llama-cpp models as that's all I'm using at the moment. If you can reach the point where you want to try to get this working with other local llm loaders, feel free, but I'm happy as long as llama.cpp is working :).

However, I did find a bug when I tried to go from the 70B model to openai- image

I accidentally clicked the "load" button twice and that happened.
This was just a silly mistake. I realize that this is largely a POC application, but could you at least make it where the program won't crash if someone accidentally presses the "load" button a second time while the model is loading? It's quite easy to accidentally double-click when you're trying to single click, which is what happened in my case. Perhaps it's possible to make the load button unclickable until the model is fully loaded?

I did another test & checked if the program would crash or do something funny if I tried to load the same config twice, and it handled that well, so that's good :). If you can just take care of the crash from accidentally clicking the load button twice while a model is actively loading, I think this feature will be ready to go!

snexus commented 11 months ago

Thanks for confirming about the memory, you helped make this project better! I will fix the app crash bug due to multiple load button click. Regarding non llama-cpp models - it might work out of the box, but please free to raise an issue if specific model type isn't unloaded properly.

snexus commented 11 months ago

I've pushed a new version, please check if it works for you. The solution isn't as elegant as I hoped it to be due to the tricky nature of Streamlit's event loop. If you are quick enough to press multiple times before the model starts to load, the loading process will stop. In that case, you might need to press the button again to load the model. The app shouldn't crash though when pressing twice or more.

As you probably noticed, there are some limitations with the Streamlit front-end. In the future, I might rewrite it in a different framework, such as Gradio or NiceGUI. Or perhaps there is an option to offload model handling to 3rd party, e.g. oobabooga via API, so can focus more on the RAG logic itself...

Hisma commented 11 months ago

yeah, I do admit that offloading the handling of the models & just focusing on the RAG would probably be a good pivot to make at some point. I'm using llama.cpp right now because your template supports it well. I think you have other model loaders but it's such a fast moving environment its hard to keep up. I personally try to use exllama2 whenever I can (exl2 models), which I'm not sure you're currently supporting. Also, with the different models, you have to take into account what prompt template to use (you use an alpaca template by default, many models use other formats), there's just so many factors to consider when you start dealing with local models. But I also have to say as a pure RAG implementation, I find this application very elegant and most of all, performant.

I see the RAG space is rapidly evolving too, so a singular focus there while, as you said, offloading the model handling to something else, sounds logical.

But anyway, I don't want to sound negative. You fixed the bug from what I can see. I clicked the "load" button multiple times as a model was loading and it didn't crash. Your worst case scenario of having to press "load" again sounds like a rare edge case, and even then it doesn't cause the program to crash.

Thank you again for implementing my requested features. This really does exactly what I need it to do now. In fact, I've found that with a 70B model (synthia v1.5), I get better answers to my queries than I do using GPT-4! Quite amazing what a good local LLM enhanced with a good RAG implementation can achieve. Keep up the good work.

snexus commented 11 months ago

Completely agree, focusing mostly on RAG will be the best way forward. I will investigate what's the best way to integrate with existing solutions to handle the local models. Thank you for the testing and suggestions, glad you are getting value from the app.

Hisma commented 11 months ago

hmmm... I tested this today and strangely, as I tried to load different models, it kept reloading the same model over and over, despite choosing different configs. Can you please confirm if you have the same behavior? I think perhaps when you fixed the bug with hitting the load button more than once not breaking the app, it perhaps made it where it won't switch models correctly. I swear I tested this before, but I just used it today and I hadn't touched the code since.

snexus commented 11 months ago

Can you describe under which circumstances it is happening? I’ve used it yesterday, it was reloading it properly. is it possible that you had multiple instances of web interface opened?

Hisma commented 11 months ago

it happened by simply loading the app the same way I always had in the past. it's possible that when I called the issue resolved, I only quickly tested that it didn't crash hitting the button multiple times, but never checked proper loading between config files again.

I went as far as to delete my original version of the app, pull the latest version from git, then re-build the venv and try again. It still has the same issue with pressing the "load" button just reloads the same config file over again.

So I did another test by reverting the webapp.py file back to the version before you fixed the crashing issue. https://github.com/snexus/llm-search/blob/05b815c7fa11781eb88c1ba2380a11a32a5c3cbf/src/llmsearch/webapp.py

When I use this version of webapp.py, I am able to properly load the files again. It does crash if I hit the load button mid-loading, as expected, but I can properly change config files again.

So not sure why I'm having different results. Ultimately, we're using the same source files, so I can't see how it's only affecting me not you.

If not too much to ask, can you try rebuilding from the latest like I did and see if it still works for you?

In the meantime, I can see if I can troubleshoot and find the problem myself, though I may get to it tomorrow. But I can confirm that clearing the app completely & re-building it, I still have the same problem.

Thanks again.

snexus commented 11 months ago

Thanks for the details, I will rebuild from scratch and test as you suggested.

snexus commented 11 months ago

Using a clean environment, can confirm the bug. Please test the fix in https://github.com/snexus/llm-search/tree/fix-webapp-config-reloading, will merge it to the main if it works for you.

Hisma commented 11 months ago

yep, all good now! I spent extra time doing various tests. Trying different prompts, loading different llms, and spent some time with it making sure I covered my normal use cases.
Can't find any issues. Thanks for the quick fix!

Hisma commented 10 months ago

Hi @snexus, I'd like to re-open this one, because I found an issue. It's one that we didn't really anticipate, but I really think it would be helpful to be added. Right now, with this feature, you can update existing embeddings. But let's say you're running the app for the first time and you have no embeddings yet, or you want to create a new embeddings with an updated config file, then the app will crash on start-up if it doesn't detect existing embeddings. Could you update the code so that it checks for existing embeddings on start-up, and if it doesn't exist, then you can use the "update embeddings" button to add new embeddings based on the config file?
Not sure how much complexity adding this embeddings check would add, because you'd obviously need to disallow queries until the embeddings exist. But the whole point of adding this feature is that you can simply install the application and run it, choose a config file and go. No need to do an initial update embeddings command before you run the streamlit UI. You can instead do everything you need to do from within the UI right after installing the app.

Thanks!

snexus commented 10 months ago

Good suggestion, the current approach is indeed not optimal from the UX perspective. Will try to implement it this or next week.

snexus commented 10 months ago

Please check the implementation here - https://github.com/snexus/llm-search/tree/webui-embeddings-generation

The current implementation will expose the embeddings generation only if embeddings aren't detected - in order not to clutter the UI with functionality that is used only once, at the beginning. If the user needs to recreate the index - can simply delete the embedding folder

Hisma commented 10 months ago

Excellent! I will test it today.

I actually found another bug with this implementation yesterday. See if you can recreate it.
What I did is I have multiple config files that point to files in different subfolders.
ie a 'config-hydrogen.yaml' that ingests docs in a folder called "c:\docs\hydrogen" and another config file 'config-protein.yaml' that is set up to ingest docs in a folder called 'c:\docs\protein'.
What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.

I know this is not so easy to explain so I hope I didn't confuse you. But in summary, changing configuration files inside the UI, while it does change the llm correctly, it doesn't respect changes in document path.

Can you look into this? Even though it was difficult to explain, I have a feeling this is probably an easy fix.

But in the meantime, I will also test the update you just made too.

Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.

snexus commented 10 months ago

What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.

Are you pressing "Load" when selecting the second ( 'c:\docs\protein') configuration? There is no way around it - just selecting it without pressing the button won't load it automatically (it was the initial approach, but quite fragile to implement in Streamlit). Configurations are considered independent, with own settings, model etc - so have to be reloaded completely, not just pointing to a new embedding folder.

Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.

No worries at all. Thanks for finding and reporting bugs, and suggesting useful features.

Hisma commented 10 months ago

What I noticed is that even if I ingest all the documents from both folders, once I load a configuration on start-up, let's say 'config-hydrogen', which looks for files in 'c:\docs\hydrogen', then decide to switch my configuration to 'config-protien' and send a query, the application will fail to switch the source document folder to 'c:\docs\protein', and instead run my query using the docs from the previous configuration.

Are you pressing "Load" when selecting the second ( 'c:\docs\protein') configuration? There is no way around it - just selecting it without pressing the button won't load it automatically (it was the initial approach, but quite fragile to implement in Streamlit). Configurations are considered independent, with own settings, model etc - so have to be reloaded completely, not just pointing to a new embedding folder.

Thanks for being open to my suggestions! I am now trying to deploy this for a research project which is how I'm finding all these little issues.

No worries at all. Thanks for finding and reporting bugs, and suggesting useful features.

Yes of course, I am actually loading the configuration. As I said, it loads fine, and it will change llms correctly, but if I query the app for something that should be in the protein folder, per the configuration, it will give me a result with a negative score and all the sources it lists point to files from the 'hydrogen' folder, which was loaded in the previous configuration.

Hisma commented 10 months ago

let me see if I can quickly recreate it again...

Hisma commented 10 months ago

good job! Looks like you did fix the issue that the program would crash it no embeddings were detected. I also appreciate how clean it looks from a UI perspective. image

Just have a small spelling error to fix. Very minor, but "creading" should be "creating" image

I will move on to try and recreate the bug I was talking about...

Hisma commented 10 months ago

I recreated the bug.

Here's what I did -

  1. loaded the config-tech.yaml file. It the doc_path contains a user manual for a iot device. When I loaded it, I had to update embeddings, so I updated the embeddings so the files could be added to the db. I asked a question about the document after it loaded, and it answered it just fine. image
  2. I then loaded config-protein.yaml. I had to update the embeddings for that as well, since those files hadn't been added yet. I then asked a question about protein, and it gave me a good answer.
  3. I then re-loaded config-tech.yaml. I did NOT update the embeddings again, as the embeddings should already exist from when I added them in step 1. BUT, the doc_path should have changed, since config_tech.yaml points to c:\docs\tech directory.
  4. I asked a question about the iot manual, and I got a negative result, and you can see even though the application shows it properly loaded the config_tech configuration file, and acknowledges the correct doc_path, it used the doc_path from the previous configuration when I ran the query, giving me a negative result.
  5. image

Hope this is clear now.

I will say however, if I do choose "update embeddings" after loading the config file, then ask the same question, it will answer it correctly. But I shouldn't need to update my embeddings every time I change my doc_path, should I? I would only expect it be necessary to update embeddings if I actually add or modify files within the doc_path. Here's showing a correct result after I chose update embeddings. image

snexus commented 10 months ago

Thanks for the detailed explanation, will try to reproduce it on my side.

I will say however, if I do choose "update embeddings" after loading the config file, then ask the same question, it will answer it correctly. But I shouldn't need to update my embeddings every time I change my doc_path, should I?

You definitely shouldn't need to update embeddings after changing the doc path

snexus commented 10 months ago

Unfortunately couldn't reproduce the issue - I followed the exact steps with 2 document sets/configs.

Having a hypothesis though - are you able to share the configs you used? Is it possible that embedding_path was the same for both of your tests?

Hisma commented 10 months ago

`cache_folder: /home/hisma/rag/cache ## specify a cache folder for embeddings models, huggingface and sentence transformers

embeddings: embeddings_path: /home/hisma/rag/embeddings ## specify a folder where embeddings will be saved.

embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory type: sentence_transformer # other supported types - "huggingface" and "instruct" model_name: "BAAI/bge-large-en-v1.5"

splade_config: # Optional batch size of sparse embeddings. Reduced if getting out-of-memory errors on CUDA. n_batch: 5

chunk_sizes: # Specify one more chunk size to split (querying multi-chunk results will be slower)

semantic_search: search_type: similarity # Currently, only similarity is supported replace_output_path: # Can specify list of search/replace settings

persist_response_db_path: "/home/hisma/rag/responsedb/response.db" # optional sqlite database filename. Allows to save responses offlien to sqlite, for future analysis.

############ An example how to use OpenAI model, requires .env file with the OpenAI key llm: type: openai params: prompt_template: | Context information is provided below. Given the context information and not prior knowledge, provide detailed answer to the question.

     ### Context:
     ---------------------
     {context}
     ---------------------

     ### Question: {question}
 model_kwargs:
   temperature: 0.2
   model_name: gpt-3.5-turbo-0301`

thats my standard config when using chatgpt. Yes embeddings_path is consistent across both config files. Are you suggesting that we should be creating multiple dbs?

snexus commented 10 months ago

Yes, the embedding path should be unique per config file. That's where the vector database embeddings are generated for the specific document collection. Using the same embedding_path and clicking "update" would overwrite the embeddings, which explains the behaviour.

Perhaps worth mentioning in the readme...

Hisma commented 10 months ago

Ahh... that's very good to know! Yeah, definitely worth mentioning in the readme.

I will update my configurations with unique embeddding paths for each file and hopefully that works. I'll test it later and report the results, then we can close this again.

Hisma commented 10 months ago

tested. It worked as you suggested. And it makes total sense that you'd want different embeddings paths for different sub-topics. I didn't want to mix different "domains" of information, for worry of making the searches less accurate, especially as the amount of files begins to grow.

Please close this and you can merge this change.

snexus commented 10 months ago

Thanks for confirming.

By the way - in case you do need to have multiple sub-topics in the same config file (and same embeddings path) - it is already supported. You can specify multiple document paths in the config, pointing to different source folders and scan settings, as shown below. The advantage of doing that - you can filter by label in the UI, rather than loading a new configuration file, but for large collections, it might work slower.

image

Hisma commented 10 months ago

wow I did not know about that either! And that would make the "document collection" tag actually useful since you can filter by tag! I always just had one per document so I didn't think the feature had any practical use. Now it makes sense.