[BUG] Target parsing for `MultiLabelFewShotGPTClassifier` (`extract_labels` and `_to_numpy`)

iryna-kondr / scikit-llm

Seamlessly integrate LLMs into scikit-learn.

MIT License

3.29k stars 267 forks source link

Hello, I'm having trouble understanding MultiLabelFewShotGPTClassifier. The dummy example with skllm.datasets.get_multilabel_classification_dataset works fine but it breaks as soon as I start applying it to my own data.

Using DataFrame input for `y`

I'm using a multi-label DataFrame with 2 targets target_0, target_1. The expected label parsing is:

y_train.columns.tolist()
# ['target_0', 'target_1']

y shape is (5, 2) (pd.DataFrame).

However when fitting the objects extract_labels parses:

clf.fit(X_train, y_train)
clf.classes_
# ['t', 'a', 'r', 'g', 'e', '_', '0', '1']

X shape is (5,) (pd.Series with strings)

Using `list` or `np.array` input for `y`.

If I instead input y as an array I get the following error coming from _to_numpy:

clf.fit(X_train, y_train)
# --> y = _to_numpy(y)
# ---> X = np.squeeze(X, axis=tuple([i for i in range(1, len(X.shape))]))
# -> return squeeze(axis=axis)
# ValueError: cannot select an axis to squeeze out which has size not equal to one

The same error occurs when converting y to a list (list[list[str]]`)

Do you have an idea why the classes are incorrectly parsed or why if fails on trying to squeeze? Would be happy to work on a PR, but first wanted to figure out if its a bug or my input is wrong.

from skllm.models.gpt.classification.few_shot import MultiLabelFewShotGPTClassifier import pandas as pd X = [ "I love reading science fiction novels, they transport me to other worlds.", # example 1 - book - sci-fi "A good mystery novel keeps me guessing until the very end.", # example 2 - book - mystery "Historical novels give me a sense of different times and places.", # example 3 - book - historical "I love watching science fiction movies, they transport me to other galaxies.", # example 4 - movie - sci-fi "A good mystery movie keeps me on the edge of my seat.", # example 5 - movie - mystery "Historical movies offer a glimpse into the past.", # example 6 - movie - historical ] y = ["books", "books", "books", "movies", "movies", "movies"] df = pd.DataFrame({"text": X, "label": y}) clf = MultiLabelFewShotGPTClassifier() clf.fit(df.text, df) clf.classes_ # > ['t', 'e', 'x', 'l', 'a', 'b']

iryna-kondr / scikit-llm

[BUG] Target parsing for `MultiLabelFewShotGPTClassifier` (`extract_labels` and `_to_numpy`) #114

Using DataFrame input for y

Using list or np.array input for y.

Using DataFrame input for `y`

Using `list` or `np.array` input for `y`.