Closed CarloLepelaars closed 3 weeks ago
To trace back your issue to how scikit-llm works:
Start here. Which leads to here
y_train is of type DataFrame and does not fit pd.Series, list, or np.ndarray. None of the conversions of to_numpy
apply and it is returned as is.
self.classes_
is then built using self._get_unique_targets(y)
which leads you here and since it is Multilabel then here.
Since your y is the unaltered df you pass a dataframe to a nested for loop.
from skllm.models.gpt.classification.few_shot import MultiLabelFewShotGPTClassifier
import pandas as pd
X = [
"I love reading science fiction novels, they transport me to other worlds.", # example 1 - book - sci-fi
"A good mystery novel keeps me guessing until the very end.", # example 2 - book - mystery
"Historical novels give me a sense of different times and places.", # example 3 - book - historical
"I love watching science fiction movies, they transport me to other galaxies.", # example 4 - movie - sci-fi
"A good mystery movie keeps me on the edge of my seat.", # example 5 - movie - mystery
"Historical movies offer a glimpse into the past.", # example 6 - movie - historical
]
y = ["books", "books", "books", "movies", "movies", "movies"]
df = pd.DataFrame({"text": X, "label": y})
clf = MultiLabelFewShotGPTClassifier()
clf.fit(df.text, df)
clf.classes_
# > ['t', 'e', 'x', 'l', 'a', 'b']
Note that it does not matter what df.text (the X) contains since the issue is how you pass y.
If you instead pass a list the issue in your case is that np.asarray(y_list, dtype=object)
returns an array of shape (n, 2) because you have a uniform number of possible labels (note how y
in their example can has 2 or 3 items making it an array of shape (n,).
Next they flatten the array and it fails because axis=1
is not of shape 1.
Hope that helps you debug it.
tldr: scikit-llm does not support DataFrames.
EDIT: just for the sake of it I commented out lines 25-27 and it at least produces the classes. I have not tested actual prediction though.
This should be fixed by #117 at least when passing the labels as a list (either flat or 2d).
Hello, I'm having trouble understanding
MultiLabelFewShotGPTClassifier
. The dummy example withskllm.datasets.get_multilabel_classification_dataset
works fine but it breaks as soon as I start applying it to my own data.Using DataFrame input for
y
I'm using a multi-label DataFrame with 2 targets
target_0
,target_1
. The expected label parsing is:y shape is (5, 2) (pd.DataFrame).
However when fitting the objects
extract_labels
parses:X shape is (5,) (pd.Series with strings)
Using
list
ornp.array
input fory
.If I instead input
y
as an array I get the following error coming from_to_numpy
:The same error occurs when converting
y
to a list (list[list[str]]`)Do you have an idea why the classes are incorrectly parsed or why if fails on trying to squeeze? Would be happy to work on a PR, but first wanted to figure out if its a bug or my input is wrong.