Closed BIGdeadLock closed 5 months ago
Never mind, the problem was not using onehot vector for the label.
Hello, I just came across this exact problem. If I understood correctly, to fix the issue, you one-hot encoded the labels, converted to pandas and then to Dataset. and it just worked? In my case, when converting to one-hot, the sparsevector type is not recognized:
Problem: 6 nominal classes Steps:
Dataset
type for SetFit, get a new error: ArrowInvalid: ('Could not convert SparseVector(6, {2: 1.0}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column label_vector with type object')
Packages:
Both maintaining nominal or just convert to numerical triggers the error you mentioned of TypeError: 'numpy.bool_' object is not iterable
. It works fine for binary text classification
Could you post the solution and a data sample of the working dataset?
@ozefreitas
I did not used pyspark to one hot but did it with numpy. Seems like it's converting it to an sparse vector that may cause the problems. Try using something else like numpy for the encoding
train_dataset
Am confused on this- I thought we simply provide {'text':
Model Loading:
Dataset creation:
def create_setfit_dataset(documents):
Training:
I try to train the model on a multiclass problem and keep getting the error:
A training example:
{'text': <text>, 'label': 10}