neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
8.98k stars 587 forks source link

Using locally saved vector model #467

Closed AndromedaGit closed 1 year ago

AndromedaGit commented 1 year ago

I've successfully built a search engine project using txtai. I've incorporated the script line: model = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None) ...which creates a local folder of models (/models--sentence-transformers--all-MiniLM-L6-v2) OK.

Because of the time required for this step (which delays my search results), I would like to use (perhaps load?) the local file folder copy now that it exists rather than "re-create" everything with this script line above. I notice that if I REMOVE the local file folder, this script line does create it again. However, every attempt at removing that script line causes failure of the program.

Is there a way to reference the local folder copy to define "model" object above without rebuilding everything each time with VectorsFactory.create ? Thanks so much for any guidance you can give.

davidmezzetti commented 1 year ago

Thank you for reaching out with the question.

The code you are sharing doesn't build any models, it downloads sentence-transformers/all-MiniLM-L6-v2 to a cache folder. Each time you call VectorsFactory.create, it will load the saved model into memory which does take time. Perhaps there is a way to cache the results of the VectorsFactory.create.

Calling this method isn't typical. Usually you would work through an Embeddings instance, which does much of the heavy lifting.

AndromedaGit commented 1 year ago

Thanks so much for your reply, David. I'm very new to AI, txtai, etc., so I apologize if some of my issue is just naivete on my part :-) That script line: model = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None) ...comes from my building a search engine project based on your demo here: https://neuml.hashnode.dev/embeddings-index-components

Oops! My mistake see next comment...

AndromedaGit commented 1 year ago

Hi David, Sorry, my mistake. On closer examination, the script line which is largely delaying my search execution (6 seconds out of a total of 8 seconds)is actually this line: from txtai.vectors import VectorsFactory

Is it possible I'm somehow still going outside i.e. on the internet) for something there? Is there anything I can do to optimize, cache, etc. that you can think of? (Imports of everything else like numpy, ANNfactory, etc, seem to execute almost instantly. ) Thanks much for any ideas you may have.

davidmezzetti commented 1 year ago

That is odd. I've never seen an import take that long, repeatedly.

The only thing I can think of doing is opening a Python command prompt and individually importing modules from this file (https://github.com/neuml/txtai/blob/master/src/python/txtai/vectors/factory.py) until you can see which one is taking a long time.

AndromedaGit commented 1 year ago

Thanks for that, David! I'm not great with command line stuff interacting with import. I'm already using pre-saved corpus embeddings, so, in the actual search program, my only use of anything to do with vectorfactory after its import is:

model= VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None) query= model.encode([qstring])

(Those two line actually execute very fast. That called-out transformer model should already be saved in a local file and pointed to by a previous line in my code: os.environ['TRANSFORMERS_CACHE']= 'C:/... if I'm understanding correctly how that part works.

So I'm thinking that the import is pulling (and possibly executing) lots of things I don't need) in this specific use case and given that I'm willing to restrict to always using "sentence-transformers/all-MiniLM-L6-v2" for this program.

With that in mind, I started peeling down imports and inserting some classes/defs/etc. directly into my py file to see what still works. I got down to having just "class TransformersVectors(Vectors)" inserted directly into my program with the line from above modified to: model= TransformersVectors({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None)

The imports under there still take about 6-7 seconds. I got stuck trying to cheat the next step but I suspect the long delay comes from an import within the Models imports?

Possibly I'm building things under there rather than simply grabbing the "sentence-transformers/all-MiniLM-L6-v2" from what should exist in a local folder? In my particular case of using an already cached "sentence-transformers/all-MiniLM-L6-v2", is there some minimal amount of imported code that would allow me to run these lines?: model= TransformersVectors({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None) query= model.encode([qstring])

I'm not sure exactly what the model= line really produces, but it seems I just need an object defined which incorporates "sentence-transformers/all-MiniLM-L6-v2"--hopefully from my already cached folder? Is there a simpler way to make that happen perhaps? I'm certainly willing to peel apart some py files under txtai or do some very directed importing for this particular use case.

Any guidance would be greatly appreciated!

AndromedaGit commented 1 year ago

Hi David, Back to this issue of lengthy FULL imports for txtai:

In my particular case of using an already cached "sentence-transformers/all-MiniLM-L6-v2", is there some minimal amount of imported code that would allow me to run these lines?: model= TransformersVectors({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None) query= model.encode([qstring])

It seems a large delay is the import of "torch" and its dependecies. Since I'm only doing the above (i.e., encoding a query string with an already cached model, is anything to do with torch really needed? I'm hoping to modify some py code under my txtai to remove dependencies (and thus remove unnecessary (in my case) huge imports.

Any guidance on torch and/or any other large imports not specifically needed to encode using the above canned/cached model would be greatly appreciated! Certainly anything to do with creating or training models would be fair game I would think--hopefully lots more?

Thanks!

davidmezzetti commented 1 year ago

Close this due to inactivity. Please re-open and open a new issue to continue the conversation.