Dynamic FewShot GPTClassifier: does it cache Embd locally?

iryna-kondr / scikit-llm

Seamlessly integrate LLMs into scikit-learn.

https://beastbyte.ai/

MIT License

3.37k stars 275 forks source link

Dynamic FewShot GPTClassifier: does it cache Embd locally? #55

Open KennyNg-19 opened 1 year ago

KennyNg-19 commented 1 year ago

I wonder:

if DynamicFewShotGPTClassifier will cache embeddings by OpenAI locally for the 1st time calling it.
And can we access them as embeds can be used in other cases, so that we can save some budget?

iryna-kondr commented 1 year ago

Hello, @KennyNg-19 The embeddings are stored inside the estimator and in theory can be accessed. However, reusing them for other use cases might not be easily achievable. Could you elaborate how exactly would you like to reuse the embeddings?

KennyNg-19 commented 1 year ago

Hi, @iryna-kondr As the whole dataset gets embeddings before fewshotclassifier runs, so their embeddings cached locally may be used in other downstream tasks after the classfication task, like semantic search or similarity comparison.

If we cannot use the embeddings generated here, the embedding functions(especially paid API service) will be called again, which increases cost.

math-sasso commented 1 year ago

I am having the same problem. I dont want to recreate the embeddings at every request. I wanna do it once and reuse (both embeddings + fitted classifier) it for future calls in my system.

AndreasKarasenko commented 6 months ago

One additional point to consider: if we rerun experiments at a later date it would be nice to simply point to preexisting embeddings instead of re-embedding them. So same exact task, same exact data.

@iryna-kondr is this something you might consider implementing?

iryna-kondr commented 5 months ago

Hi, @AndreasKarasenko. You can pickle the estimator (with embeddings) and then load it at a later date. See our discussion here: https://discord.com/channels/1112768381406425138/1125476385750782012/1125478710427009044

AndreasKarasenko commented 5 months ago

Thanks for the info! Based off of that I figured out a way to get the data and embedding lists so I can store them locally. I think this issue can be closed now?