shaypal5 / skift

scikit-learn wrappers for Python fastText.
MIT License
234 stars 23 forks source link

Add multi-label support #13

Open shaypal5 opened 4 years ago

shaypal5 commented 4 years ago

Add support to providing multi-label labels in a scikit-learn-compliant format, utilizing (under the hood) fasttext's support for multi-label scenarios.

e412 commented 2 years ago

Hi, is this implemented? I am having issues with the Multilabel Case. The Transformation with MultiLabelBinarizer leads to the error: ValueError: FastTextClassifier methods must get a one-dimensional numpy array as the y parameter.

What can I do?

Thank you very much.

e412 commented 2 years ago

Or do you have any other recommendation how to Cross Validate the Results of fastText supervised training (MultiLabel)? I am looking for a solution for weeks now... Any help is very much appreciated.

Kind Regards, Eva

shaypal5 commented 2 years ago

Hey Eva!

I'll try to help you as best as I can. However, I don't have the time to implement it right now. I can guide you through contributing the code yourself. :)

First, as the issue is open, it shouldn't come as a surprise that this isn't implemented.

As as you can see in this example file from the FastText tutorial for text classification, this is the format for multilabel problems:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?

So, very much like the multiclass format, just with multiple __label__ tags at the start of each line.

Two main areas of code in skift require adaptation for multilabel problems to be supported:

  1. The FtClassifierABC class must be adapted to accept y arguments that are also of shape (n_samples, n_outputs), as in sklearn. This includes such methods as _validate_y and fit.
y: array-like of shape (n_samples,) or (n_samples, n_outputs)
  1. The util.dump_xy_to_fasttext_format() function must be adapted to properly dump multilabel targets, in the format I linked to above.
e412 commented 2 years ago

Hi Shaypal,

thanks a lot for replying!

I already got the correct format in my data. But unfortunately I dont think I am able to implement the feature by myself.

Do you by any chance have some experience perfoming a cross validation on the outcome of fasttext supervised training? Because that is the reason I was looking into this wrapper class. I couldnt find a lot of up to date information regarding validation of fasttext.

Cheers Eva