iryna-kondr / scikit-llm

Seamlessly integrate LLMs into scikit-learn.
https://beastbyte.ai/
MIT License
3.38k stars 275 forks source link

[BUG] Target parsing for `MultiLabelFewShotGPTClassifier` (`extract_labels` and `_to_numpy`) #114

Closed CarloLepelaars closed 3 weeks ago

CarloLepelaars commented 2 months ago

Hello, I'm having trouble understanding MultiLabelFewShotGPTClassifier. The dummy example with skllm.datasets.get_multilabel_classification_dataset works fine but it breaks as soon as I start applying it to my own data.

Using DataFrame input for y

I'm using a multi-label DataFrame with 2 targets target_0, target_1. The expected label parsing is:

y_train.columns.tolist()
# ['target_0', 'target_1']

y shape is (5, 2) (pd.DataFrame).

However when fitting the objects extract_labels parses:

clf.fit(X_train, y_train)
clf.classes_
# ['t', 'a', 'r', 'g', 'e', '_', '0', '1']

X shape is (5,) (pd.Series with strings)

Using list or np.array input for y.

If I instead input y as an array I get the following error coming from _to_numpy:

clf.fit(X_train, y_train)
# --> y = _to_numpy(y)
# ---> X = np.squeeze(X, axis=tuple([i for i in range(1, len(X.shape))]))
# -> return squeeze(axis=axis)
# ValueError: cannot select an axis to squeeze out which has size not equal to one

The same error occurs when converting y to a list (list[list[str]]`)

Do you have an idea why the classes are incorrectly parsed or why if fails on trying to squeeze? Would be happy to work on a PR, but first wanted to figure out if its a bug or my input is wrong.

AndreasKarasenko commented 2 months ago

To trace back your issue to how scikit-llm works: Start here. Which leads to here y_train is of type DataFrame and does not fit pd.Series, list, or np.ndarray. None of the conversions of to_numpy apply and it is returned as is. self.classes_ is then built using self._get_unique_targets(y) which leads you here and since it is Multilabel then here.

Since your y is the unaltered df you pass a dataframe to a nested for loop.

from skllm.models.gpt.classification.few_shot import MultiLabelFewShotGPTClassifier
import pandas as pd

X = [
    "I love reading science fiction novels, they transport me to other worlds.", # example 1 - book - sci-fi
    "A good mystery novel keeps me guessing until the very end.", # example 2 - book - mystery
    "Historical novels give me a sense of different times and places.", # example 3 - book - historical
    "I love watching science fiction movies, they transport me to other galaxies.", # example 4 - movie - sci-fi
    "A good mystery movie keeps me on the edge of my seat.", # example 5 - movie - mystery
    "Historical movies offer a glimpse into the past.", # example 6 - movie - historical
]

y = ["books", "books", "books", "movies", "movies", "movies"]
df = pd.DataFrame({"text": X, "label": y})

clf = MultiLabelFewShotGPTClassifier()
clf.fit(df.text, df)
clf.classes_
# > ['t', 'e', 'x', 'l', 'a', 'b']

Note that it does not matter what df.text (the X) contains since the issue is how you pass y.

If you instead pass a list the issue in your case is that np.asarray(y_list, dtype=object) returns an array of shape (n, 2) because you have a uniform number of possible labels (note how y in their example can has 2 or 3 items making it an array of shape (n,). Next they flatten the array and it fails because axis=1 is not of shape 1.

Hope that helps you debug it.

tldr: scikit-llm does not support DataFrames.

EDIT: just for the sake of it I commented out lines 25-27 and it at least produces the classes. I have not tested actual prediction though.

OKUA1 commented 3 weeks ago

This should be fixed by #117 at least when passing the labels as a list (either flat or 2d).