Attributes - Githubissues

iryna-kondr / scikit-llm

Seamlessly integrate LLMs into scikit-learn.

https://beastbyte.ai/

MIT License

3.38k stars 275 forks source link

Attributes #95

Closed AndreasKarasenko closed 5 months ago

AndreasKarasenko commented 6 months ago

Fixes #93. Also propagates key, org to the vectorizer at an earlier stage if it is set through the model. However key, org might be more easily exposed. At the same time currently it is possible to set key, org like below which still exposed them through __dict__.

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
from skllm.config import SKLLMConfig

API_KEY="..."
SKLLMConfig.set_openai_key(API_KEY)

# demo sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset()

clf = ZeroShotGPTClassifier(model="gpt-3.5-turbo", key=API_KEY)
clf.__dict__ # exposes API_KEY - old version also exposes API_KEY
# {'model': 'gpt-3.5-turbo', 'default_label': 'Random', 'prompt_template': None, 'openai_key': '...', 'openai_org': None}
clf # exposes API_KEY - old version threw error

AndreasKarasenko commented 6 months ago

Just FYI: if we solve the AttributeError one way or another I can make some headway on using the scikit-learn internals to parallelize the predict function to gain some speedup.

OKUA1 commented 6 months ago

@AndreasKarasenko , regarding the parallelization, implementing it would be a very nice improvement as openai is way more generous with the rate limits and parallel requests won't immediately consume all of the quota. However, I don't think there is a strong need to use any scikit-learn internals for that. predict() is basically just a for-loop which can be parallelized using the ThreadPool executor (the GIL would also be released during the API call).

AndreasKarasenko commented 5 months ago

I have done the minor updates requested and will work on a seperate PR for the parallelization, not sure how quick I will be though. I agree that threads are a better choice than spawning a whole process. Also no sure how I will handle the KNN part of the Dynamic FSL yet but let's see.

AndreasKarasenko commented 5 months ago

For now I have a rudimentary parallel predict that I'll use for a paper over the weekend. The kneighbors from the memory index should be thread safe, so calling the retrieve from multiple threads should not be an issue. In that sense maybe I will add a PR for this ?

Throttling when too many calls are made would be nice. The issue remains that RPM and TPM depend on the model and are subject to change.

OKUA1 commented 5 months ago

Hi @AndreasKarasenko

Thanks a lot for the changes. I think we can merge now, but will have a closer look a bit later.

Regarding the parallelism:

It is better to open a separate issue first so we have a central place for a discussion.
You can create a PR for the implementation you already have. However, in a longer term it is probably better to have num_workers as an init argument. So maybe you can immediately add a deprecation warning or something similar (when num_workers > 1) to indicate that the API is temporary.